14 KiB
VR180 Human Matting Proof of Concept - Det-SAM2 Approach
Project Overview
A proof-of-concept implementation to test the feasibility of using Det-SAM2 for automated human matting on VR180 3D side-by-side equirectangular video. The system will process a 30-second test clip to evaluate quality, performance, and resource requirements on local RTX 3080 hardware, with design considerations for cloud GPU scaling.
Input Specifications
- Format: VR180 3D side-by-side equirectangular video
- Resolution: 6144x3072 (3072x3072 per eye)
- Test Duration: 30 seconds
- Layout: Left eye (0-3071px), Right eye (3072-6143px)
Core Functionality
Automatic Person Detection
- Method: YOLOv8 integration with Det-SAM2
- Detection: Automatic bounding box placement on all humans
- Minimal Manual Input: Fully automated pipeline with no point selection required
Processing Strategy
- Primary Approach: Process both eyes using disparity mapping optimization
- Fallback: Independent processing per eye if disparity mapping proves complex
- Chunking: Adaptive segmentation (full 30s clip preferred, fallback to smaller chunks if VRAM limited)
Scaling and Quality Options
- Resolution Scaling: 25%, 50%, or 100% processing resolution
- Mask Upscaling: AI-based upscaling to full resolution for final output
- Quality vs Performance: Configurable tradeoffs for local vs cloud processing
Configuration System
YAML/TOML Configuration File
input:
video_path: "path/to/input.mp4"
processing:
scale_factor: 0.5 # 0.25, 0.5, 1.0
chunk_size: 900 # frames, 0 for full video
overlap_frames: 60 # for chunked processing
detection:
confidence_threshold: 0.7
model: "yolov8n" # yolov8n, yolov8s, yolov8m
matting:
use_disparity_mapping: true
memory_offload: true
fp16: true
output:
path: "path/to/output/"
format: "alpha" # "alpha" or "greenscreen"
background_color: [0, 255, 0] # for greenscreen
maintain_sbs: true # keep side-by-side format
hardware:
device: "cuda"
max_vram_gb: 10 # RTX 3080 limit
Technical Implementation
Memory Optimization (Det-SAM2 Enhancements)
- CPU Offloading:
offload_video_to_cpu=True - FP16 Storage: Reduce memory usage by ~50%
- Frame Release:
release_old_frames()for constant VRAM usage - Adaptive Chunking: Automatic chunk size based on available VRAM
VR180-Specific Optimizations
- Stereo Processing: Leverage disparity mapping for efficiency
- Cross-Eye Validation: Ensure consistency between left/right views
- Edge Refinement: Multi-resolution processing for clean matting boundaries
Output Options
- Alpha Channel: Transparent PNG sequence or video with alpha
- Green Screen: Configurable background color for traditional keying
- Format Preservation: Maintain original SBS layout or output separate eyes
Performance Targets
Local RTX 3080 (10GB VRAM)
- 25% Scale: ~5-8 FPS processing, ~6 minutes for 30s clip
- 50% Scale: ~3-5 FPS processing, ~10 minutes for 30s clip
- 100% Scale: Chunked processing required, ~15-20 minutes for 30s clip
Cloud GPU Scaling (Future)
- Design Considerations: Docker containerization ready
- Provider Agnostic: Compatible with RunPod, Vast.ai, etc.
- Batch Processing: Queue-based job distribution
- Cost Estimation: Target $0.10-0.50 per 30s clip processing
Quality Assessment Features
Automated Quality Metrics
- Edge Consistency: Measure aliasing and stair-stepping
- Temporal Stability: Frame-to-frame consistency scoring
- Stereo Alignment: Left/right eye correspondence validation
Debug/Analysis Outputs
- Detection Visualization: Bounding boxes overlaid on frames
- Confidence Maps: Per-pixel matting confidence scores
- Processing Stats: VRAM usage, FPS, chunk information
Deliverables
Phase 1: Core Implementation
- Det-SAM2 Integration: Automatic detection pipeline
- VRAM Optimization: Memory management for RTX 3080
- Basic Matting: Single-resolution processing
- Configuration System: YAML-based parameter control
Phase 2: VR180 Optimization
- Disparity Processing: Stereo-aware matting
- Multi-Resolution: Scaling and upsampling pipeline
- Quality Assessment: Automated metrics and visualization
- Edge Refinement: Anti-aliasing and boundary smoothing
Phase 3: Production Ready
- Cloud GPU Support: Docker containerization
- Batch Processing: Multiple video queue system
- Performance Profiling: Detailed resource usage analytics
- Quality Validation: Comprehensive testing suite
Post-Implementation Optimization Opportunities
Based on first successful 30-second test clip execution results (A40 GPU, 50% scale, 9x200 frame chunks)
Performance Analysis Findings
- Processing Speed: ~0.54s per frame (64.4s for 120 frames per chunk)
- VRAM Utilization: Only 2.5% (1.11GB of 45GB available) - significantly underutilized
- RAM Usage: 106GB used of 494GB available (21.5%)
- Primary Bottleneck: Intermediate ffmpeg encoding operations per chunk
Identified Optimization Categories
Category A: Performance Improvements (Quick Wins)
-
Audio Track Preservation ⚠️ CRITICAL
- Issue: Output video missing audio track from input
- Solution: Use ffmpeg to copy audio stream during final video creation
- Implementation: Add
-c:a copyto final ffmpeg command - Impact: Essential for production usability
- Risk: Low, standard ffmpeg operation
-
Frame Count Synchronization ⚠️ CRITICAL
- Issue: Audio sync drift if input/output frame counts differ
- Solution: Validate exact frame count preservation throughout pipeline
- Implementation: Frame count verification + duration matching
- Impact: Prevents audio desync in long videos
- Risk: Low, validation feature
-
Memory Usage Reality Check ⚠️ IMPORTANT
- Current assumption: Unlimited RAM for memory-only pipeline
- Reality: RunPod container limited to ~48GB RAM
- Risk calculation: 1-hour video = ~213k frames = potential 20-40GB+ memory usage
- Solution: Implement streaming output instead of full in-memory accumulation
- Impact: Enables processing of long-form content
- Risk: Medium, requires pipeline restructuring
-
Larger Chunk Sizes
- Current: 200 frames per chunk (conservative for 10GB RTX 3080)
- Opportunity: 600-800 frames per chunk on high-VRAM systems
- Impact: Reduce 9 chunks to 2-3 chunks, fewer intermediate operations
- Risk: Low, easily configurable
-
Streaming Output Pipeline
- Current: Accumulate all processed frames in memory, write once
- Opportunity: Write processed chunks to temporary segments, merge at end
- Impact: Constant memory usage regardless of video length
- Risk: Medium, requires temporary file management
-
Enhanced Performance Profiling
- Current: Basic memory monitoring
- Opportunity: Detailed timing per processing stage (detection, propagation, encoding)
- Impact: Identify exact bottlenecks for targeted optimization
- Risk: Low, debugging feature
-
Parallel Eye Processing
- Current: Sequential left eye → right eye processing
- Opportunity: Process both eyes simultaneously
- Impact: Potential 50% speedup, better GPU utilization
- Risk: Medium, memory management complexity
Category B: Stereo Consistency Fixes (Critical for VR)
-
Master-Slave Eye Processing
- Issue: Independent detection leads to mismatched person counts between eyes
- Solution: Use left eye detections as "seeds" for right eye processing
- Impact: Ensures identical person detection across stereo pair
- Risk: Low, maintains current quality while improving consistency
-
Cross-Eye Detection Validation
- Issue: Hair/clothing included on one eye but not the other
- Solution: Compare detection results, flag inconsistencies for reprocessing
- Impact: 90%+ stereo alignment improvement
- Risk: Low, fallback to current behavior
-
Disparity-Aware Segmentation
- Issue: Segmentation boundaries differ between eyes despite same person
- Solution: Use stereo disparity to correlate features between eyes
- Impact: True stereo-consistent matting
- Risk: High, complex implementation
-
Joint Stereo Detection
- Issue: YOLO runs independently on each eye
- Solution: Run YOLO on full SBS frame, split detections spatially
- Impact: Guaranteed identical detection counts
- Risk: Medium, requires detection coordinate mapping
Category C: Advanced Optimizations (Future)
-
Adaptive Memory Management
- Opportunity: Dynamic chunk sizing based on real-time VRAM usage
- Impact: Optimal resource utilization across different hardware
- Risk: Medium, complex heuristics
-
Multi-Resolution Processing
- Opportunity: Initial processing at lower resolution, edge refinement at full
- Impact: Speed improvement while maintaining quality
- Risk: Medium, quality validation required
-
Enhanced Workflow Documentation
- Issue: Unclear intermediate data lifecycle
- Solution: Detailed logging of chunk processing, optional intermediate preservation
- Impact: Better debugging and user understanding
- Risk: Low, documentation feature
Implementation Strategy
- Phase A: Quick performance wins (larger chunks, profiling)
- Phase B: Stereo consistency (master-slave, validation)
- Phase C: Advanced features (disparity-aware, memory optimization)
Configuration Extensions Required
processing:
chunk_size: 600 # Increase from 200 for high-VRAM systems
memory_pipeline: false # Skip intermediate video creation (disabled due to RAM limits)
streaming_output: true # Write chunks progressively instead of accumulating
parallel_eyes: false # Process eyes simultaneously
max_memory_gb: 40 # Realistic RAM limit for RunPod containers
audio:
preserve_audio: true # Copy audio track from input to output
verify_sync: true # Validate frame count and duration matching
audio_codec: "copy" # Preserve original audio codec
stereo:
consistency_mode: "master_slave" # "independent", "master_slave", "joint"
validation_threshold: 0.8 # Similarity threshold between eyes
correction_method: "transfer" # "transfer", "reprocess", "ensemble"
performance:
profile_enabled: true # Detailed timing analysis
preserve_intermediates: false # For debugging workflow
debugging:
log_intermediate_workflow: true # Document chunk lifecycle
save_detection_visualization: false # Debug detection mismatches
frame_count_validation: true # Ensure exact frame preservation
Technical Implementation Details
Audio Preservation Implementation
# During final video save, include audio stream copy
ffmpeg_cmd = [
'ffmpeg', '-y',
'-framerate', str(fps),
'-i', frame_pattern, # Video frames
'-i', input_video_path, # Original video for audio
'-c:v', 'h264_nvenc', # GPU video codec (with CPU fallback)
'-c:a', 'copy', # Copy audio without re-encoding
'-map', '0:v:0', # Map video from first input
'-map', '1:a:0', # Map audio from second input
'-shortest', # Match shortest stream duration
output_path
]
Streaming Output Implementation
# Instead of accumulating frames in memory:
class StreamingVideoWriter:
def __init__(self, output_path, fps, audio_source):
self.temp_segments = []
self.current_segment = 0
def write_chunk(self, processed_frames):
# Write chunk to temporary segment
segment_path = f"temp_segment_{self.current_segment}.mp4"
self.write_video_segment(processed_frames, segment_path)
self.temp_segments.append(segment_path)
self.current_segment += 1
def finalize(self):
# Merge all segments with audio preservation
self.merge_segments_with_audio()
Memory Usage Calculation
def estimate_memory_requirements(duration_seconds, fps, resolution_scale=0.5):
"""Calculate memory usage for different video lengths"""
frames = duration_seconds * fps
# Per-frame memory (rough estimates for VR180 at 50% scale)
frame_size_mb = (3072 * 1536 * 3 * 4) / (1024 * 1024) # ~18MB per frame
total_memory_gb = (frames * frame_size_mb) / 1024
return {
'duration': duration_seconds,
'total_frames': frames,
'estimated_memory_gb': total_memory_gb,
'safe_for_48gb': total_memory_gb < 40
}
# Example outputs:
# 30 seconds: ~2.7GB (safe)
# 5 minutes: ~27GB (borderline)
# 1 hour: ~324GB (requires streaming)
Success Criteria
Technical Feasibility
- Process 30s VR180 clip without manual intervention
- Maintain <10GB VRAM usage on RTX 3080
- Achieve acceptable matting quality at 50% scale
- Complete processing in <15 minutes locally
Quality Benchmarks
- Clean edges with minimal artifacts
- Temporal consistency across frames
- Stereo alignment between left/right eyes
- Usable results for green screen compositing
Scalability Validation
- Configuration-driven parameter control
- Clear performance vs quality tradeoffs identified
- Docker deployment pathway established
- Cost/benefit analysis for cloud GPU usage
Risk Mitigation
VRAM Limitations
- Fallback: Automatic chunking with overlap processing
- Monitoring: Real-time VRAM usage tracking
- Graceful Degradation: Quality reduction before failure
Quality Issues
- Validation Pipeline: Automated quality assessment
- Manual Override: Optional bounding box adjustment
- Fallback Methods: Integration points for RVM if needed
Performance Bottlenecks
- Profiling: Detailed timing analysis per component
- Optimization: Identify CPU vs GPU bound operations
- Scaling Strategy: Clear upgrade path to cloud GPUs