# VR180 Human Matting Proof of Concept - Det-SAM2 Approach ## Project Overview A proof-of-concept implementation to test the feasibility of using Det-SAM2 for automated human matting on VR180 3D side-by-side equirectangular video. The system will process a 30-second test clip to evaluate quality, performance, and resource requirements on local RTX 3080 hardware, with design considerations for cloud GPU scaling. ## Input Specifications - **Format**: VR180 3D side-by-side equirectangular video - **Resolution**: 6144x3072 (3072x3072 per eye) - **Test Duration**: 30 seconds - **Layout**: Left eye (0-3071px), Right eye (3072-6143px) ## Core Functionality ### Automatic Person Detection - **Method**: YOLOv8 integration with Det-SAM2 - **Detection**: Automatic bounding box placement on all humans - **Minimal Manual Input**: Fully automated pipeline with no point selection required ### Processing Strategy - **Primary Approach**: Process both eyes using disparity mapping optimization - **Fallback**: Independent processing per eye if disparity mapping proves complex - **Chunking**: Adaptive segmentation (full 30s clip preferred, fallback to smaller chunks if VRAM limited) ### Scaling and Quality Options - **Resolution Scaling**: 25%, 50%, or 100% processing resolution - **Mask Upscaling**: AI-based upscaling to full resolution for final output - **Quality vs Performance**: Configurable tradeoffs for local vs cloud processing ## Configuration System ### YAML/TOML Configuration File ```yaml input: video_path: "path/to/input.mp4" processing: scale_factor: 0.5 # 0.25, 0.5, 1.0 chunk_size: 900 # frames, 0 for full video overlap_frames: 60 # for chunked processing detection: confidence_threshold: 0.7 model: "yolov8n" # yolov8n, yolov8s, yolov8m matting: use_disparity_mapping: true memory_offload: true fp16: true output: path: "path/to/output/" format: "alpha" # "alpha" or "greenscreen" background_color: [0, 255, 0] # for greenscreen maintain_sbs: true # keep side-by-side format hardware: device: "cuda" max_vram_gb: 10 # RTX 3080 limit ``` ## Technical Implementation ### Memory Optimization (Det-SAM2 Enhancements) - **CPU Offloading**: `offload_video_to_cpu=True` - **FP16 Storage**: Reduce memory usage by ~50% - **Frame Release**: `release_old_frames()` for constant VRAM usage - **Adaptive Chunking**: Automatic chunk size based on available VRAM ### VR180-Specific Optimizations - **Stereo Processing**: Leverage disparity mapping for efficiency - **Cross-Eye Validation**: Ensure consistency between left/right views - **Edge Refinement**: Multi-resolution processing for clean matting boundaries ### Output Options - **Alpha Channel**: Transparent PNG sequence or video with alpha - **Green Screen**: Configurable background color for traditional keying - **Format Preservation**: Maintain original SBS layout or output separate eyes ## Performance Targets ### Local RTX 3080 (10GB VRAM) - **25% Scale**: ~5-8 FPS processing, ~6 minutes for 30s clip - **50% Scale**: ~3-5 FPS processing, ~10 minutes for 30s clip - **100% Scale**: Chunked processing required, ~15-20 minutes for 30s clip ### Cloud GPU Scaling (Future) - **Design Considerations**: Docker containerization ready - **Provider Agnostic**: Compatible with RunPod, Vast.ai, etc. - **Batch Processing**: Queue-based job distribution - **Cost Estimation**: Target $0.10-0.50 per 30s clip processing ## Quality Assessment Features ### Automated Quality Metrics - **Edge Consistency**: Measure aliasing and stair-stepping - **Temporal Stability**: Frame-to-frame consistency scoring - **Stereo Alignment**: Left/right eye correspondence validation ### Debug/Analysis Outputs - **Detection Visualization**: Bounding boxes overlaid on frames - **Confidence Maps**: Per-pixel matting confidence scores - **Processing Stats**: VRAM usage, FPS, chunk information ## Deliverables ### Phase 1: Core Implementation 1. **Det-SAM2 Integration**: Automatic detection pipeline 2. **VRAM Optimization**: Memory management for RTX 3080 3. **Basic Matting**: Single-resolution processing 4. **Configuration System**: YAML-based parameter control ### Phase 2: VR180 Optimization 1. **Disparity Processing**: Stereo-aware matting 2. **Multi-Resolution**: Scaling and upsampling pipeline 3. **Quality Assessment**: Automated metrics and visualization 4. **Edge Refinement**: Anti-aliasing and boundary smoothing ### Phase 3: Production Ready 1. **Cloud GPU Support**: Docker containerization 2. **Batch Processing**: Multiple video queue system 3. **Performance Profiling**: Detailed resource usage analytics 4. **Quality Validation**: Comprehensive testing suite ## Post-Implementation Optimization Opportunities *Based on first successful 30-second test clip execution results (A40 GPU, 50% scale, 9x200 frame chunks)* ### Performance Analysis Findings - **Processing Speed**: ~0.54s per frame (64.4s for 120 frames per chunk) - **VRAM Utilization**: Only 2.5% (1.11GB of 45GB available) - significantly underutilized - **RAM Usage**: 106GB used of 494GB available (21.5%) - **Primary Bottleneck**: Intermediate ffmpeg encoding operations per chunk ### Identified Optimization Categories #### Category A: Performance Improvements (Quick Wins) 1. **Audio Track Preservation** ⚠️ **CRITICAL** - Issue: Output video missing audio track from input - Solution: Use ffmpeg to copy audio stream during final video creation - Implementation: Add `-c:a copy` to final ffmpeg command - Impact: Essential for production usability - Risk: Low, standard ffmpeg operation 2. **Frame Count Synchronization** ⚠️ **CRITICAL** - Issue: Audio sync drift if input/output frame counts differ - Solution: Validate exact frame count preservation throughout pipeline - Implementation: Frame count verification + duration matching - Impact: Prevents audio desync in long videos - Risk: Low, validation feature 3. **Memory Usage Reality Check** ⚠️ **IMPORTANT** - Current assumption: Unlimited RAM for memory-only pipeline - Reality: RunPod container limited to ~48GB RAM - Risk calculation: 1-hour video = ~213k frames = potential 20-40GB+ memory usage - Solution: Implement streaming output instead of full in-memory accumulation - Impact: Enables processing of long-form content - Risk: Medium, requires pipeline restructuring 4. **Larger Chunk Sizes** - Current: 200 frames per chunk (conservative for 10GB RTX 3080) - Opportunity: 600-800 frames per chunk on high-VRAM systems - Impact: Reduce 9 chunks to 2-3 chunks, fewer intermediate operations - Risk: Low, easily configurable 5. **Streaming Output Pipeline** - Current: Accumulate all processed frames in memory, write once - Opportunity: Write processed chunks to temporary segments, merge at end - Impact: Constant memory usage regardless of video length - Risk: Medium, requires temporary file management 6. **Enhanced Performance Profiling** - Current: Basic memory monitoring - Opportunity: Detailed timing per processing stage (detection, propagation, encoding) - Impact: Identify exact bottlenecks for targeted optimization - Risk: Low, debugging feature 7. **Parallel Eye Processing** - Current: Sequential left eye → right eye processing - Opportunity: Process both eyes simultaneously - Impact: Potential 50% speedup, better GPU utilization - Risk: Medium, memory management complexity #### Category B: Stereo Consistency Fixes (Critical for VR) 1. **Master-Slave Eye Processing** - Issue: Independent detection leads to mismatched person counts between eyes - Solution: Use left eye detections as "seeds" for right eye processing - Impact: Ensures identical person detection across stereo pair - Risk: Low, maintains current quality while improving consistency 2. **Cross-Eye Detection Validation** - Issue: Hair/clothing included on one eye but not the other - Solution: Compare detection results, flag inconsistencies for reprocessing - Impact: 90%+ stereo alignment improvement - Risk: Low, fallback to current behavior 3. **Disparity-Aware Segmentation** - Issue: Segmentation boundaries differ between eyes despite same person - Solution: Use stereo disparity to correlate features between eyes - Impact: True stereo-consistent matting - Risk: High, complex implementation 4. **Joint Stereo Detection** - Issue: YOLO runs independently on each eye - Solution: Run YOLO on full SBS frame, split detections spatially - Impact: Guaranteed identical detection counts - Risk: Medium, requires detection coordinate mapping #### Category C: Advanced Optimizations (Future) 1. **Adaptive Memory Management** - Opportunity: Dynamic chunk sizing based on real-time VRAM usage - Impact: Optimal resource utilization across different hardware - Risk: Medium, complex heuristics 2. **Multi-Resolution Processing** - Opportunity: Initial processing at lower resolution, edge refinement at full - Impact: Speed improvement while maintaining quality - Risk: Medium, quality validation required 3. **Enhanced Workflow Documentation** - Issue: Unclear intermediate data lifecycle - Solution: Detailed logging of chunk processing, optional intermediate preservation - Impact: Better debugging and user understanding - Risk: Low, documentation feature ### Implementation Strategy - **Phase A**: Quick performance wins (larger chunks, profiling) - **Phase B**: Stereo consistency (master-slave, validation) - **Phase C**: Advanced features (disparity-aware, memory optimization) ### Configuration Extensions Required ```yaml processing: chunk_size: 600 # Increase from 200 for high-VRAM systems memory_pipeline: false # Skip intermediate video creation (disabled due to RAM limits) streaming_output: true # Write chunks progressively instead of accumulating parallel_eyes: false # Process eyes simultaneously max_memory_gb: 40 # Realistic RAM limit for RunPod containers audio: preserve_audio: true # Copy audio track from input to output verify_sync: true # Validate frame count and duration matching audio_codec: "copy" # Preserve original audio codec stereo: consistency_mode: "master_slave" # "independent", "master_slave", "joint" validation_threshold: 0.8 # Similarity threshold between eyes correction_method: "transfer" # "transfer", "reprocess", "ensemble" performance: profile_enabled: true # Detailed timing analysis preserve_intermediates: false # For debugging workflow debugging: log_intermediate_workflow: true # Document chunk lifecycle save_detection_visualization: false # Debug detection mismatches frame_count_validation: true # Ensure exact frame preservation ``` ### Technical Implementation Details #### Audio Preservation Implementation ```python # During final video save, include audio stream copy ffmpeg_cmd = [ 'ffmpeg', '-y', '-framerate', str(fps), '-i', frame_pattern, # Video frames '-i', input_video_path, # Original video for audio '-c:v', 'h264_nvenc', # GPU video codec (with CPU fallback) '-c:a', 'copy', # Copy audio without re-encoding '-map', '0:v:0', # Map video from first input '-map', '1:a:0', # Map audio from second input '-shortest', # Match shortest stream duration output_path ] ``` #### Streaming Output Implementation ```python # Instead of accumulating frames in memory: class StreamingVideoWriter: def __init__(self, output_path, fps, audio_source): self.temp_segments = [] self.current_segment = 0 def write_chunk(self, processed_frames): # Write chunk to temporary segment segment_path = f"temp_segment_{self.current_segment}.mp4" self.write_video_segment(processed_frames, segment_path) self.temp_segments.append(segment_path) self.current_segment += 1 def finalize(self): # Merge all segments with audio preservation self.merge_segments_with_audio() ``` #### Memory Usage Calculation ```python def estimate_memory_requirements(duration_seconds, fps, resolution_scale=0.5): """Calculate memory usage for different video lengths""" frames = duration_seconds * fps # Per-frame memory (rough estimates for VR180 at 50% scale) frame_size_mb = (3072 * 1536 * 3 * 4) / (1024 * 1024) # ~18MB per frame total_memory_gb = (frames * frame_size_mb) / 1024 return { 'duration': duration_seconds, 'total_frames': frames, 'estimated_memory_gb': total_memory_gb, 'safe_for_48gb': total_memory_gb < 40 } # Example outputs: # 30 seconds: ~2.7GB (safe) # 5 minutes: ~27GB (borderline) # 1 hour: ~324GB (requires streaming) ``` ## Success Criteria ### Technical Feasibility - [ ] Process 30s VR180 clip without manual intervention - [ ] Maintain <10GB VRAM usage on RTX 3080 - [ ] Achieve acceptable matting quality at 50% scale - [ ] Complete processing in <15 minutes locally ### Quality Benchmarks - [ ] Clean edges with minimal artifacts - [ ] Temporal consistency across frames - [ ] Stereo alignment between left/right eyes - [ ] Usable results for green screen compositing ### Scalability Validation - [ ] Configuration-driven parameter control - [ ] Clear performance vs quality tradeoffs identified - [ ] Docker deployment pathway established - [ ] Cost/benefit analysis for cloud GPU usage ## Risk Mitigation ### VRAM Limitations - **Fallback**: Automatic chunking with overlap processing - **Monitoring**: Real-time VRAM usage tracking - **Graceful Degradation**: Quality reduction before failure ### Quality Issues - **Validation Pipeline**: Automated quality assessment - **Manual Override**: Optional bounding box adjustment - **Fallback Methods**: Integration points for RVM if needed ### Performance Bottlenecks - **Profiling**: Detailed timing analysis per component - **Optimization**: Identify CPU vs GPU bound operations - **Scaling Strategy**: Clear upgrade path to cloud GPUs