360 lines
14 KiB
Markdown
360 lines
14 KiB
Markdown
# VR180 Human Matting Proof of Concept - Det-SAM2 Approach
|
|
|
|
## Project Overview
|
|
|
|
A proof-of-concept implementation to test the feasibility of using Det-SAM2 for automated human matting on VR180 3D side-by-side equirectangular video. The system will process a 30-second test clip to evaluate quality, performance, and resource requirements on local RTX 3080 hardware, with design considerations for cloud GPU scaling.
|
|
|
|
## Input Specifications
|
|
|
|
- **Format**: VR180 3D side-by-side equirectangular video
|
|
- **Resolution**: 6144x3072 (3072x3072 per eye)
|
|
- **Test Duration**: 30 seconds
|
|
- **Layout**: Left eye (0-3071px), Right eye (3072-6143px)
|
|
|
|
## Core Functionality
|
|
|
|
### Automatic Person Detection
|
|
- **Method**: YOLOv8 integration with Det-SAM2
|
|
- **Detection**: Automatic bounding box placement on all humans
|
|
- **Minimal Manual Input**: Fully automated pipeline with no point selection required
|
|
|
|
### Processing Strategy
|
|
- **Primary Approach**: Process both eyes using disparity mapping optimization
|
|
- **Fallback**: Independent processing per eye if disparity mapping proves complex
|
|
- **Chunking**: Adaptive segmentation (full 30s clip preferred, fallback to smaller chunks if VRAM limited)
|
|
|
|
### Scaling and Quality Options
|
|
- **Resolution Scaling**: 25%, 50%, or 100% processing resolution
|
|
- **Mask Upscaling**: AI-based upscaling to full resolution for final output
|
|
- **Quality vs Performance**: Configurable tradeoffs for local vs cloud processing
|
|
|
|
## Configuration System
|
|
|
|
### YAML/TOML Configuration File
|
|
```yaml
|
|
input:
|
|
video_path: "path/to/input.mp4"
|
|
|
|
processing:
|
|
scale_factor: 0.5 # 0.25, 0.5, 1.0
|
|
chunk_size: 900 # frames, 0 for full video
|
|
overlap_frames: 60 # for chunked processing
|
|
|
|
detection:
|
|
confidence_threshold: 0.7
|
|
model: "yolov8n" # yolov8n, yolov8s, yolov8m
|
|
|
|
matting:
|
|
use_disparity_mapping: true
|
|
memory_offload: true
|
|
fp16: true
|
|
|
|
output:
|
|
path: "path/to/output/"
|
|
format: "alpha" # "alpha" or "greenscreen"
|
|
background_color: [0, 255, 0] # for greenscreen
|
|
maintain_sbs: true # keep side-by-side format
|
|
|
|
hardware:
|
|
device: "cuda"
|
|
max_vram_gb: 10 # RTX 3080 limit
|
|
```
|
|
|
|
## Technical Implementation
|
|
|
|
### Memory Optimization (Det-SAM2 Enhancements)
|
|
- **CPU Offloading**: `offload_video_to_cpu=True`
|
|
- **FP16 Storage**: Reduce memory usage by ~50%
|
|
- **Frame Release**: `release_old_frames()` for constant VRAM usage
|
|
- **Adaptive Chunking**: Automatic chunk size based on available VRAM
|
|
|
|
### VR180-Specific Optimizations
|
|
- **Stereo Processing**: Leverage disparity mapping for efficiency
|
|
- **Cross-Eye Validation**: Ensure consistency between left/right views
|
|
- **Edge Refinement**: Multi-resolution processing for clean matting boundaries
|
|
|
|
### Output Options
|
|
- **Alpha Channel**: Transparent PNG sequence or video with alpha
|
|
- **Green Screen**: Configurable background color for traditional keying
|
|
- **Format Preservation**: Maintain original SBS layout or output separate eyes
|
|
|
|
## Performance Targets
|
|
|
|
### Local RTX 3080 (10GB VRAM)
|
|
- **25% Scale**: ~5-8 FPS processing, ~6 minutes for 30s clip
|
|
- **50% Scale**: ~3-5 FPS processing, ~10 minutes for 30s clip
|
|
- **100% Scale**: Chunked processing required, ~15-20 minutes for 30s clip
|
|
|
|
### Cloud GPU Scaling (Future)
|
|
- **Design Considerations**: Docker containerization ready
|
|
- **Provider Agnostic**: Compatible with RunPod, Vast.ai, etc.
|
|
- **Batch Processing**: Queue-based job distribution
|
|
- **Cost Estimation**: Target $0.10-0.50 per 30s clip processing
|
|
|
|
## Quality Assessment Features
|
|
|
|
### Automated Quality Metrics
|
|
- **Edge Consistency**: Measure aliasing and stair-stepping
|
|
- **Temporal Stability**: Frame-to-frame consistency scoring
|
|
- **Stereo Alignment**: Left/right eye correspondence validation
|
|
|
|
### Debug/Analysis Outputs
|
|
- **Detection Visualization**: Bounding boxes overlaid on frames
|
|
- **Confidence Maps**: Per-pixel matting confidence scores
|
|
- **Processing Stats**: VRAM usage, FPS, chunk information
|
|
|
|
## Deliverables
|
|
|
|
### Phase 1: Core Implementation
|
|
1. **Det-SAM2 Integration**: Automatic detection pipeline
|
|
2. **VRAM Optimization**: Memory management for RTX 3080
|
|
3. **Basic Matting**: Single-resolution processing
|
|
4. **Configuration System**: YAML-based parameter control
|
|
|
|
### Phase 2: VR180 Optimization
|
|
1. **Disparity Processing**: Stereo-aware matting
|
|
2. **Multi-Resolution**: Scaling and upsampling pipeline
|
|
3. **Quality Assessment**: Automated metrics and visualization
|
|
4. **Edge Refinement**: Anti-aliasing and boundary smoothing
|
|
|
|
### Phase 3: Production Ready
|
|
1. **Cloud GPU Support**: Docker containerization
|
|
2. **Batch Processing**: Multiple video queue system
|
|
3. **Performance Profiling**: Detailed resource usage analytics
|
|
4. **Quality Validation**: Comprehensive testing suite
|
|
|
|
## Post-Implementation Optimization Opportunities
|
|
|
|
*Based on first successful 30-second test clip execution results (A40 GPU, 50% scale, 9x200 frame chunks)*
|
|
|
|
### Performance Analysis Findings
|
|
- **Processing Speed**: ~0.54s per frame (64.4s for 120 frames per chunk)
|
|
- **VRAM Utilization**: Only 2.5% (1.11GB of 45GB available) - significantly underutilized
|
|
- **RAM Usage**: 106GB used of 494GB available (21.5%)
|
|
- **Primary Bottleneck**: Intermediate ffmpeg encoding operations per chunk
|
|
|
|
### Identified Optimization Categories
|
|
|
|
#### Category A: Performance Improvements (Quick Wins)
|
|
1. **Audio Track Preservation** ⚠️ **CRITICAL**
|
|
- Issue: Output video missing audio track from input
|
|
- Solution: Use ffmpeg to copy audio stream during final video creation
|
|
- Implementation: Add `-c:a copy` to final ffmpeg command
|
|
- Impact: Essential for production usability
|
|
- Risk: Low, standard ffmpeg operation
|
|
|
|
2. **Frame Count Synchronization** ⚠️ **CRITICAL**
|
|
- Issue: Audio sync drift if input/output frame counts differ
|
|
- Solution: Validate exact frame count preservation throughout pipeline
|
|
- Implementation: Frame count verification + duration matching
|
|
- Impact: Prevents audio desync in long videos
|
|
- Risk: Low, validation feature
|
|
|
|
3. **Memory Usage Reality Check** ⚠️ **IMPORTANT**
|
|
- Current assumption: Unlimited RAM for memory-only pipeline
|
|
- Reality: RunPod container limited to ~48GB RAM
|
|
- Risk calculation: 1-hour video = ~213k frames = potential 20-40GB+ memory usage
|
|
- Solution: Implement streaming output instead of full in-memory accumulation
|
|
- Impact: Enables processing of long-form content
|
|
- Risk: Medium, requires pipeline restructuring
|
|
|
|
4. **Larger Chunk Sizes**
|
|
- Current: 200 frames per chunk (conservative for 10GB RTX 3080)
|
|
- Opportunity: 600-800 frames per chunk on high-VRAM systems
|
|
- Impact: Reduce 9 chunks to 2-3 chunks, fewer intermediate operations
|
|
- Risk: Low, easily configurable
|
|
|
|
5. **Streaming Output Pipeline**
|
|
- Current: Accumulate all processed frames in memory, write once
|
|
- Opportunity: Write processed chunks to temporary segments, merge at end
|
|
- Impact: Constant memory usage regardless of video length
|
|
- Risk: Medium, requires temporary file management
|
|
|
|
6. **Enhanced Performance Profiling**
|
|
- Current: Basic memory monitoring
|
|
- Opportunity: Detailed timing per processing stage (detection, propagation, encoding)
|
|
- Impact: Identify exact bottlenecks for targeted optimization
|
|
- Risk: Low, debugging feature
|
|
|
|
7. **Parallel Eye Processing**
|
|
- Current: Sequential left eye → right eye processing
|
|
- Opportunity: Process both eyes simultaneously
|
|
- Impact: Potential 50% speedup, better GPU utilization
|
|
- Risk: Medium, memory management complexity
|
|
|
|
#### Category B: Stereo Consistency Fixes (Critical for VR)
|
|
1. **Master-Slave Eye Processing**
|
|
- Issue: Independent detection leads to mismatched person counts between eyes
|
|
- Solution: Use left eye detections as "seeds" for right eye processing
|
|
- Impact: Ensures identical person detection across stereo pair
|
|
- Risk: Low, maintains current quality while improving consistency
|
|
|
|
2. **Cross-Eye Detection Validation**
|
|
- Issue: Hair/clothing included on one eye but not the other
|
|
- Solution: Compare detection results, flag inconsistencies for reprocessing
|
|
- Impact: 90%+ stereo alignment improvement
|
|
- Risk: Low, fallback to current behavior
|
|
|
|
3. **Disparity-Aware Segmentation**
|
|
- Issue: Segmentation boundaries differ between eyes despite same person
|
|
- Solution: Use stereo disparity to correlate features between eyes
|
|
- Impact: True stereo-consistent matting
|
|
- Risk: High, complex implementation
|
|
|
|
4. **Joint Stereo Detection**
|
|
- Issue: YOLO runs independently on each eye
|
|
- Solution: Run YOLO on full SBS frame, split detections spatially
|
|
- Impact: Guaranteed identical detection counts
|
|
- Risk: Medium, requires detection coordinate mapping
|
|
|
|
#### Category C: Advanced Optimizations (Future)
|
|
1. **Adaptive Memory Management**
|
|
- Opportunity: Dynamic chunk sizing based on real-time VRAM usage
|
|
- Impact: Optimal resource utilization across different hardware
|
|
- Risk: Medium, complex heuristics
|
|
|
|
2. **Multi-Resolution Processing**
|
|
- Opportunity: Initial processing at lower resolution, edge refinement at full
|
|
- Impact: Speed improvement while maintaining quality
|
|
- Risk: Medium, quality validation required
|
|
|
|
3. **Enhanced Workflow Documentation**
|
|
- Issue: Unclear intermediate data lifecycle
|
|
- Solution: Detailed logging of chunk processing, optional intermediate preservation
|
|
- Impact: Better debugging and user understanding
|
|
- Risk: Low, documentation feature
|
|
|
|
### Implementation Strategy
|
|
- **Phase A**: Quick performance wins (larger chunks, profiling)
|
|
- **Phase B**: Stereo consistency (master-slave, validation)
|
|
- **Phase C**: Advanced features (disparity-aware, memory optimization)
|
|
|
|
### Configuration Extensions Required
|
|
```yaml
|
|
processing:
|
|
chunk_size: 600 # Increase from 200 for high-VRAM systems
|
|
memory_pipeline: false # Skip intermediate video creation (disabled due to RAM limits)
|
|
streaming_output: true # Write chunks progressively instead of accumulating
|
|
parallel_eyes: false # Process eyes simultaneously
|
|
max_memory_gb: 40 # Realistic RAM limit for RunPod containers
|
|
|
|
audio:
|
|
preserve_audio: true # Copy audio track from input to output
|
|
verify_sync: true # Validate frame count and duration matching
|
|
audio_codec: "copy" # Preserve original audio codec
|
|
|
|
stereo:
|
|
consistency_mode: "master_slave" # "independent", "master_slave", "joint"
|
|
validation_threshold: 0.8 # Similarity threshold between eyes
|
|
correction_method: "transfer" # "transfer", "reprocess", "ensemble"
|
|
|
|
performance:
|
|
profile_enabled: true # Detailed timing analysis
|
|
preserve_intermediates: false # For debugging workflow
|
|
|
|
debugging:
|
|
log_intermediate_workflow: true # Document chunk lifecycle
|
|
save_detection_visualization: false # Debug detection mismatches
|
|
frame_count_validation: true # Ensure exact frame preservation
|
|
```
|
|
|
|
### Technical Implementation Details
|
|
|
|
#### Audio Preservation Implementation
|
|
```python
|
|
# During final video save, include audio stream copy
|
|
ffmpeg_cmd = [
|
|
'ffmpeg', '-y',
|
|
'-framerate', str(fps),
|
|
'-i', frame_pattern, # Video frames
|
|
'-i', input_video_path, # Original video for audio
|
|
'-c:v', 'h264_nvenc', # GPU video codec (with CPU fallback)
|
|
'-c:a', 'copy', # Copy audio without re-encoding
|
|
'-map', '0:v:0', # Map video from first input
|
|
'-map', '1:a:0', # Map audio from second input
|
|
'-shortest', # Match shortest stream duration
|
|
output_path
|
|
]
|
|
```
|
|
|
|
#### Streaming Output Implementation
|
|
```python
|
|
# Instead of accumulating frames in memory:
|
|
class StreamingVideoWriter:
|
|
def __init__(self, output_path, fps, audio_source):
|
|
self.temp_segments = []
|
|
self.current_segment = 0
|
|
|
|
def write_chunk(self, processed_frames):
|
|
# Write chunk to temporary segment
|
|
segment_path = f"temp_segment_{self.current_segment}.mp4"
|
|
self.write_video_segment(processed_frames, segment_path)
|
|
self.temp_segments.append(segment_path)
|
|
self.current_segment += 1
|
|
|
|
def finalize(self):
|
|
# Merge all segments with audio preservation
|
|
self.merge_segments_with_audio()
|
|
```
|
|
|
|
#### Memory Usage Calculation
|
|
```python
|
|
def estimate_memory_requirements(duration_seconds, fps, resolution_scale=0.5):
|
|
"""Calculate memory usage for different video lengths"""
|
|
frames = duration_seconds * fps
|
|
|
|
# Per-frame memory (rough estimates for VR180 at 50% scale)
|
|
frame_size_mb = (3072 * 1536 * 3 * 4) / (1024 * 1024) # ~18MB per frame
|
|
|
|
total_memory_gb = (frames * frame_size_mb) / 1024
|
|
|
|
return {
|
|
'duration': duration_seconds,
|
|
'total_frames': frames,
|
|
'estimated_memory_gb': total_memory_gb,
|
|
'safe_for_48gb': total_memory_gb < 40
|
|
}
|
|
|
|
# Example outputs:
|
|
# 30 seconds: ~2.7GB (safe)
|
|
# 5 minutes: ~27GB (borderline)
|
|
# 1 hour: ~324GB (requires streaming)
|
|
```
|
|
|
|
## Success Criteria
|
|
|
|
### Technical Feasibility
|
|
- [ ] Process 30s VR180 clip without manual intervention
|
|
- [ ] Maintain <10GB VRAM usage on RTX 3080
|
|
- [ ] Achieve acceptable matting quality at 50% scale
|
|
- [ ] Complete processing in <15 minutes locally
|
|
|
|
### Quality Benchmarks
|
|
- [ ] Clean edges with minimal artifacts
|
|
- [ ] Temporal consistency across frames
|
|
- [ ] Stereo alignment between left/right eyes
|
|
- [ ] Usable results for green screen compositing
|
|
|
|
### Scalability Validation
|
|
- [ ] Configuration-driven parameter control
|
|
- [ ] Clear performance vs quality tradeoffs identified
|
|
- [ ] Docker deployment pathway established
|
|
- [ ] Cost/benefit analysis for cloud GPU usage
|
|
|
|
## Risk Mitigation
|
|
|
|
### VRAM Limitations
|
|
- **Fallback**: Automatic chunking with overlap processing
|
|
- **Monitoring**: Real-time VRAM usage tracking
|
|
- **Graceful Degradation**: Quality reduction before failure
|
|
|
|
### Quality Issues
|
|
- **Validation Pipeline**: Automated quality assessment
|
|
- **Manual Override**: Optional bounding box adjustment
|
|
- **Fallback Methods**: Integration points for RVM if needed
|
|
|
|
### Performance Bottlenecks
|
|
- **Profiling**: Detailed timing analysis per component
|
|
- **Optimization**: Identify CPU vs GPU bound operations
|
|
- **Scaling Strategy**: Clear upgrade path to cloud GPUs
|