test2/spec.md

# VR180 Human Matting Proof of Concept - Det-SAM2 Approach

## Project Overview

A proof-of-concept implementation to test the feasibility of using Det-SAM2 for automated human matting on VR180 3D side-by-side equirectangular video. The system will process a 30-second test clip to evaluate quality, performance, and resource requirements on local RTX 3080 hardware, with design considerations for cloud GPU scaling.

## Input Specifications

- **Format**: VR180 3D side-by-side equirectangular video
- **Resolution**: 6144x3072 (3072x3072 per eye)
- **Test Duration**: 30 seconds
- **Layout**: Left eye (0-3071px), Right eye (3072-6143px)

## Core Functionality

### Automatic Person Detection
- **Method**: YOLOv8 integration with Det-SAM2
- **Detection**: Automatic bounding box placement on all humans
- **Minimal Manual Input**: Fully automated pipeline with no point selection required

### Processing Strategy
- **Primary Approach**: Process both eyes using disparity mapping optimization
- **Fallback**: Independent processing per eye if disparity mapping proves complex
- **Chunking**: Adaptive segmentation (full 30s clip preferred, fallback to smaller chunks if VRAM limited)

### Scaling and Quality Options
- **Resolution Scaling**: 25%, 50%, or 100% processing resolution
- **Mask Upscaling**: AI-based upscaling to full resolution for final output
- **Quality vs Performance**: Configurable tradeoffs for local vs cloud processing

## Configuration System

### YAML/TOML Configuration File
```yaml
input:
  video_path: "path/to/input.mp4"

processing:
  scale_factor: 0.5  # 0.25, 0.5, 1.0
  chunk_size: 900    # frames, 0 for full video
  overlap_frames: 60 # for chunked processing

detection:
  confidence_threshold: 0.7
  model: "yolov8n"   # yolov8n, yolov8s, yolov8m

matting:
  use_disparity_mapping: true
  memory_offload: true
  fp16: true

output:
  path: "path/to/output/"
  format: "alpha"     # "alpha" or "greenscreen"
  background_color: [0, 255, 0]  # for greenscreen
  maintain_sbs: true  # keep side-by-side format

hardware:
  device: "cuda"
  max_vram_gb: 10     # RTX 3080 limit
```

## Technical Implementation

### Memory Optimization (Det-SAM2 Enhancements)
- **CPU Offloading**: `offload_video_to_cpu=True`
- **FP16 Storage**: Reduce memory usage by ~50%
- **Frame Release**: `release_old_frames()` for constant VRAM usage
- **Adaptive Chunking**: Automatic chunk size based on available VRAM

### VR180-Specific Optimizations
- **Stereo Processing**: Leverage disparity mapping for efficiency
- **Cross-Eye Validation**: Ensure consistency between left/right views
- **Edge Refinement**: Multi-resolution processing for clean matting boundaries

### Output Options
- **Alpha Channel**: Transparent PNG sequence or video with alpha
- **Green Screen**: Configurable background color for traditional keying
- **Format Preservation**: Maintain original SBS layout or output separate eyes

## Performance Targets

### Local RTX 3080 (10GB VRAM)
- **25% Scale**: ~5-8 FPS processing, ~6 minutes for 30s clip
- **50% Scale**: ~3-5 FPS processing, ~10 minutes for 30s clip
- **100% Scale**: Chunked processing required, ~15-20 minutes for 30s clip

### Cloud GPU Scaling (Future)
- **Design Considerations**: Docker containerization ready
- **Provider Agnostic**: Compatible with RunPod, Vast.ai, etc.
- **Batch Processing**: Queue-based job distribution
- **Cost Estimation**: Target $0.10-0.50 per 30s clip processing

## Quality Assessment Features

### Automated Quality Metrics
- **Edge Consistency**: Measure aliasing and stair-stepping
- **Temporal Stability**: Frame-to-frame consistency scoring
- **Stereo Alignment**: Left/right eye correspondence validation

### Debug/Analysis Outputs
- **Detection Visualization**: Bounding boxes overlaid on frames
- **Confidence Maps**: Per-pixel matting confidence scores
- **Processing Stats**: VRAM usage, FPS, chunk information

## Deliverables

### Phase 1: Core Implementation
1. **Det-SAM2 Integration**: Automatic detection pipeline
2. **VRAM Optimization**: Memory management for RTX 3080
3. **Basic Matting**: Single-resolution processing
4. **Configuration System**: YAML-based parameter control

### Phase 2: VR180 Optimization
1. **Disparity Processing**: Stereo-aware matting
2. **Multi-Resolution**: Scaling and upsampling pipeline
3. **Quality Assessment**: Automated metrics and visualization
4. **Edge Refinement**: Anti-aliasing and boundary smoothing

### Phase 3: Production Ready
1. **Cloud GPU Support**: Docker containerization
2. **Batch Processing**: Multiple video queue system
3. **Performance Profiling**: Detailed resource usage analytics
4. **Quality Validation**: Comprehensive testing suite

## Post-Implementation Optimization Opportunities

*Based on first successful 30-second test clip execution results (A40 GPU, 50% scale, 9x200 frame chunks)*

### Performance Analysis Findings
- **Processing Speed**: ~0.54s per frame (64.4s for 120 frames per chunk)
- **VRAM Utilization**: Only 2.5% (1.11GB of 45GB available) - significantly underutilized
- **RAM Usage**: 106GB used of 494GB available (21.5%)
- **Primary Bottleneck**: Intermediate ffmpeg encoding operations per chunk

### Identified Optimization Categories

#### Category A: Performance Improvements (Quick Wins)
1. **Audio Track Preservation** ⚠️ **CRITICAL**
   - Issue: Output video missing audio track from input
   - Solution: Use ffmpeg to copy audio stream during final video creation
   - Implementation: Add `-c:a copy` to final ffmpeg command
   - Impact: Essential for production usability
   - Risk: Low, standard ffmpeg operation

2. **Frame Count Synchronization** ⚠️ **CRITICAL**
   - Issue: Audio sync drift if input/output frame counts differ
   - Solution: Validate exact frame count preservation throughout pipeline
   - Implementation: Frame count verification + duration matching
   - Impact: Prevents audio desync in long videos
   - Risk: Low, validation feature

3. **Memory Usage Reality Check** ⚠️ **IMPORTANT**
   - Current assumption: Unlimited RAM for memory-only pipeline
   - Reality: RunPod container limited to ~48GB RAM
   - Risk calculation: 1-hour video = ~213k frames = potential 20-40GB+ memory usage
   - Solution: Implement streaming output instead of full in-memory accumulation
   - Impact: Enables processing of long-form content
   - Risk: Medium, requires pipeline restructuring

4. **Larger Chunk Sizes**
   - Current: 200 frames per chunk (conservative for 10GB RTX 3080)
   - Opportunity: 600-800 frames per chunk on high-VRAM systems
   - Impact: Reduce 9 chunks to 2-3 chunks, fewer intermediate operations
   - Risk: Low, easily configurable

5. **Streaming Output Pipeline**
   - Current: Accumulate all processed frames in memory, write once
   - Opportunity: Write processed chunks to temporary segments, merge at end
   - Impact: Constant memory usage regardless of video length
   - Risk: Medium, requires temporary file management

6. **Enhanced Performance Profiling**
   - Current: Basic memory monitoring
   - Opportunity: Detailed timing per processing stage (detection, propagation, encoding)
   - Impact: Identify exact bottlenecks for targeted optimization
   - Risk: Low, debugging feature

7. **Parallel Eye Processing**
   - Current: Sequential left eye → right eye processing
   - Opportunity: Process both eyes simultaneously
   - Impact: Potential 50% speedup, better GPU utilization
   - Risk: Medium, memory management complexity

#### Category B: Stereo Consistency Fixes (Critical for VR)
1. **Master-Slave Eye Processing**
   - Issue: Independent detection leads to mismatched person counts between eyes
   - Solution: Use left eye detections as "seeds" for right eye processing
   - Impact: Ensures identical person detection across stereo pair
   - Risk: Low, maintains current quality while improving consistency

2. **Cross-Eye Detection Validation**
   - Issue: Hair/clothing included on one eye but not the other
   - Solution: Compare detection results, flag inconsistencies for reprocessing
   - Impact: 90%+ stereo alignment improvement
   - Risk: Low, fallback to current behavior

3. **Disparity-Aware Segmentation**
   - Issue: Segmentation boundaries differ between eyes despite same person
   - Solution: Use stereo disparity to correlate features between eyes
   - Impact: True stereo-consistent matting
   - Risk: High, complex implementation

4. **Joint Stereo Detection**
   - Issue: YOLO runs independently on each eye
   - Solution: Run YOLO on full SBS frame, split detections spatially
   - Impact: Guaranteed identical detection counts
   - Risk: Medium, requires detection coordinate mapping

#### Category C: Advanced Optimizations (Future)
1. **Adaptive Memory Management**
   - Opportunity: Dynamic chunk sizing based on real-time VRAM usage
   - Impact: Optimal resource utilization across different hardware
   - Risk: Medium, complex heuristics

2. **Multi-Resolution Processing**
   - Opportunity: Initial processing at lower resolution, edge refinement at full
   - Impact: Speed improvement while maintaining quality
   - Risk: Medium, quality validation required

3. **Enhanced Workflow Documentation**
   - Issue: Unclear intermediate data lifecycle
   - Solution: Detailed logging of chunk processing, optional intermediate preservation
   - Impact: Better debugging and user understanding
   - Risk: Low, documentation feature

### Implementation Strategy
- **Phase A**: Quick performance wins (larger chunks, profiling)
- **Phase B**: Stereo consistency (master-slave, validation)
- **Phase C**: Advanced features (disparity-aware, memory optimization)

### Configuration Extensions Required
```yaml
processing:
  chunk_size: 600  # Increase from 200 for high-VRAM systems
  memory_pipeline: false  # Skip intermediate video creation (disabled due to RAM limits)
  streaming_output: true  # Write chunks progressively instead of accumulating
  parallel_eyes: false  # Process eyes simultaneously
  max_memory_gb: 40  # Realistic RAM limit for RunPod containers

audio:
  preserve_audio: true  # Copy audio track from input to output
  verify_sync: true  # Validate frame count and duration matching
  audio_codec: "copy"  # Preserve original audio codec

stereo:
  consistency_mode: "master_slave"  # "independent", "master_slave", "joint"
  validation_threshold: 0.8  # Similarity threshold between eyes
  correction_method: "transfer"  # "transfer", "reprocess", "ensemble"

performance:
  profile_enabled: true  # Detailed timing analysis
  preserve_intermediates: false  # For debugging workflow

debugging:
  log_intermediate_workflow: true  # Document chunk lifecycle
  save_detection_visualization: false  # Debug detection mismatches
  frame_count_validation: true  # Ensure exact frame preservation
```

### Technical Implementation Details

#### Audio Preservation Implementation
```python
# During final video save, include audio stream copy
ffmpeg_cmd = [
    'ffmpeg', '-y',
    '-framerate', str(fps),
    '-i', frame_pattern,           # Video frames
    '-i', input_video_path,        # Original video for audio
    '-c:v', 'h264_nvenc',         # GPU video codec (with CPU fallback)
    '-c:a', 'copy',               # Copy audio without re-encoding
    '-map', '0:v:0',              # Map video from first input
    '-map', '1:a:0',              # Map audio from second input
    '-shortest',                  # Match shortest stream duration
    output_path
]
```

#### Streaming Output Implementation
```python
# Instead of accumulating frames in memory:
class StreamingVideoWriter:
    def __init__(self, output_path, fps, audio_source):
        self.temp_segments = []
        self.current_segment = 0

    def write_chunk(self, processed_frames):
        # Write chunk to temporary segment
        segment_path = f"temp_segment_{self.current_segment}.mp4"
        self.write_video_segment(processed_frames, segment_path)
        self.temp_segments.append(segment_path)
        self.current_segment += 1

    def finalize(self):
        # Merge all segments with audio preservation
        self.merge_segments_with_audio()
```

#### Memory Usage Calculation
```python
def estimate_memory_requirements(duration_seconds, fps, resolution_scale=0.5):
    """Calculate memory usage for different video lengths"""
    frames = duration_seconds * fps

    # Per-frame memory (rough estimates for VR180 at 50% scale)
    frame_size_mb = (3072 * 1536 * 3 * 4) / (1024 * 1024)  # ~18MB per frame

    total_memory_gb = (frames * frame_size_mb) / 1024

    return {
        'duration': duration_seconds,
        'total_frames': frames,
        'estimated_memory_gb': total_memory_gb,
        'safe_for_48gb': total_memory_gb < 40
    }

# Example outputs:
# 30 seconds: ~2.7GB (safe)
# 5 minutes: ~27GB (borderline)
# 1 hour: ~324GB (requires streaming)
```

## Success Criteria

### Technical Feasibility
- [ ] Process 30s VR180 clip without manual intervention
- [ ] Maintain <10GB VRAM usage on RTX 3080
- [ ] Achieve acceptable matting quality at 50% scale
- [ ] Complete processing in <15 minutes locally

### Quality Benchmarks
- [ ] Clean edges with minimal artifacts
- [ ] Temporal consistency across frames
- [ ] Stereo alignment between left/right eyes
- [ ] Usable results for green screen compositing

### Scalability Validation
- [ ] Configuration-driven parameter control
- [ ] Clear performance vs quality tradeoffs identified
- [ ] Docker deployment pathway established
- [ ] Cost/benefit analysis for cloud GPU usage

## Risk Mitigation

### VRAM Limitations
- **Fallback**: Automatic chunking with overlap processing
- **Monitoring**: Real-time VRAM usage tracking
- **Graceful Degradation**: Quality reduction before failure

### Quality Issues
- **Validation Pipeline**: Automated quality assessment
- **Manual Override**: Optional bounding box adjustment
- **Fallback Methods**: Integration points for RVM if needed

### Performance Bottlenecks
- **Profiling**: Detailed timing analysis per component
- **Optimization**: Identify CPU vs GPU bound operations
- **Scaling Strategy**: Clear upgrade path to cloud GPUs