Files
test2/spec.md

14 KiB

VR180 Human Matting Proof of Concept - Det-SAM2 Approach

Project Overview

A proof-of-concept implementation to test the feasibility of using Det-SAM2 for automated human matting on VR180 3D side-by-side equirectangular video. The system will process a 30-second test clip to evaluate quality, performance, and resource requirements on local RTX 3080 hardware, with design considerations for cloud GPU scaling.

Input Specifications

  • Format: VR180 3D side-by-side equirectangular video
  • Resolution: 6144x3072 (3072x3072 per eye)
  • Test Duration: 30 seconds
  • Layout: Left eye (0-3071px), Right eye (3072-6143px)

Core Functionality

Automatic Person Detection

  • Method: YOLOv8 integration with Det-SAM2
  • Detection: Automatic bounding box placement on all humans
  • Minimal Manual Input: Fully automated pipeline with no point selection required

Processing Strategy

  • Primary Approach: Process both eyes using disparity mapping optimization
  • Fallback: Independent processing per eye if disparity mapping proves complex
  • Chunking: Adaptive segmentation (full 30s clip preferred, fallback to smaller chunks if VRAM limited)

Scaling and Quality Options

  • Resolution Scaling: 25%, 50%, or 100% processing resolution
  • Mask Upscaling: AI-based upscaling to full resolution for final output
  • Quality vs Performance: Configurable tradeoffs for local vs cloud processing

Configuration System

YAML/TOML Configuration File

input:
  video_path: "path/to/input.mp4"
  
processing:
  scale_factor: 0.5  # 0.25, 0.5, 1.0
  chunk_size: 900    # frames, 0 for full video
  overlap_frames: 60 # for chunked processing
  
detection:
  confidence_threshold: 0.7
  model: "yolov8n"   # yolov8n, yolov8s, yolov8m
  
matting:
  use_disparity_mapping: true
  memory_offload: true
  fp16: true
  
output:
  path: "path/to/output/"
  format: "alpha"     # "alpha" or "greenscreen"
  background_color: [0, 255, 0]  # for greenscreen
  maintain_sbs: true  # keep side-by-side format
  
hardware:
  device: "cuda"
  max_vram_gb: 10     # RTX 3080 limit

Technical Implementation

Memory Optimization (Det-SAM2 Enhancements)

  • CPU Offloading: offload_video_to_cpu=True
  • FP16 Storage: Reduce memory usage by ~50%
  • Frame Release: release_old_frames() for constant VRAM usage
  • Adaptive Chunking: Automatic chunk size based on available VRAM

VR180-Specific Optimizations

  • Stereo Processing: Leverage disparity mapping for efficiency
  • Cross-Eye Validation: Ensure consistency between left/right views
  • Edge Refinement: Multi-resolution processing for clean matting boundaries

Output Options

  • Alpha Channel: Transparent PNG sequence or video with alpha
  • Green Screen: Configurable background color for traditional keying
  • Format Preservation: Maintain original SBS layout or output separate eyes

Performance Targets

Local RTX 3080 (10GB VRAM)

  • 25% Scale: ~5-8 FPS processing, ~6 minutes for 30s clip
  • 50% Scale: ~3-5 FPS processing, ~10 minutes for 30s clip
  • 100% Scale: Chunked processing required, ~15-20 minutes for 30s clip

Cloud GPU Scaling (Future)

  • Design Considerations: Docker containerization ready
  • Provider Agnostic: Compatible with RunPod, Vast.ai, etc.
  • Batch Processing: Queue-based job distribution
  • Cost Estimation: Target $0.10-0.50 per 30s clip processing

Quality Assessment Features

Automated Quality Metrics

  • Edge Consistency: Measure aliasing and stair-stepping
  • Temporal Stability: Frame-to-frame consistency scoring
  • Stereo Alignment: Left/right eye correspondence validation

Debug/Analysis Outputs

  • Detection Visualization: Bounding boxes overlaid on frames
  • Confidence Maps: Per-pixel matting confidence scores
  • Processing Stats: VRAM usage, FPS, chunk information

Deliverables

Phase 1: Core Implementation

  1. Det-SAM2 Integration: Automatic detection pipeline
  2. VRAM Optimization: Memory management for RTX 3080
  3. Basic Matting: Single-resolution processing
  4. Configuration System: YAML-based parameter control

Phase 2: VR180 Optimization

  1. Disparity Processing: Stereo-aware matting
  2. Multi-Resolution: Scaling and upsampling pipeline
  3. Quality Assessment: Automated metrics and visualization
  4. Edge Refinement: Anti-aliasing and boundary smoothing

Phase 3: Production Ready

  1. Cloud GPU Support: Docker containerization
  2. Batch Processing: Multiple video queue system
  3. Performance Profiling: Detailed resource usage analytics
  4. Quality Validation: Comprehensive testing suite

Post-Implementation Optimization Opportunities

Based on first successful 30-second test clip execution results (A40 GPU, 50% scale, 9x200 frame chunks)

Performance Analysis Findings

  • Processing Speed: ~0.54s per frame (64.4s for 120 frames per chunk)
  • VRAM Utilization: Only 2.5% (1.11GB of 45GB available) - significantly underutilized
  • RAM Usage: 106GB used of 494GB available (21.5%)
  • Primary Bottleneck: Intermediate ffmpeg encoding operations per chunk

Identified Optimization Categories

Category A: Performance Improvements (Quick Wins)

  1. Audio Track Preservation ⚠️ CRITICAL

    • Issue: Output video missing audio track from input
    • Solution: Use ffmpeg to copy audio stream during final video creation
    • Implementation: Add -c:a copy to final ffmpeg command
    • Impact: Essential for production usability
    • Risk: Low, standard ffmpeg operation
  2. Frame Count Synchronization ⚠️ CRITICAL

    • Issue: Audio sync drift if input/output frame counts differ
    • Solution: Validate exact frame count preservation throughout pipeline
    • Implementation: Frame count verification + duration matching
    • Impact: Prevents audio desync in long videos
    • Risk: Low, validation feature
  3. Memory Usage Reality Check ⚠️ IMPORTANT

    • Current assumption: Unlimited RAM for memory-only pipeline
    • Reality: RunPod container limited to ~48GB RAM
    • Risk calculation: 1-hour video = ~213k frames = potential 20-40GB+ memory usage
    • Solution: Implement streaming output instead of full in-memory accumulation
    • Impact: Enables processing of long-form content
    • Risk: Medium, requires pipeline restructuring
  4. Larger Chunk Sizes

    • Current: 200 frames per chunk (conservative for 10GB RTX 3080)
    • Opportunity: 600-800 frames per chunk on high-VRAM systems
    • Impact: Reduce 9 chunks to 2-3 chunks, fewer intermediate operations
    • Risk: Low, easily configurable
  5. Streaming Output Pipeline

    • Current: Accumulate all processed frames in memory, write once
    • Opportunity: Write processed chunks to temporary segments, merge at end
    • Impact: Constant memory usage regardless of video length
    • Risk: Medium, requires temporary file management
  6. Enhanced Performance Profiling

    • Current: Basic memory monitoring
    • Opportunity: Detailed timing per processing stage (detection, propagation, encoding)
    • Impact: Identify exact bottlenecks for targeted optimization
    • Risk: Low, debugging feature
  7. Parallel Eye Processing

    • Current: Sequential left eye → right eye processing
    • Opportunity: Process both eyes simultaneously
    • Impact: Potential 50% speedup, better GPU utilization
    • Risk: Medium, memory management complexity

Category B: Stereo Consistency Fixes (Critical for VR)

  1. Master-Slave Eye Processing

    • Issue: Independent detection leads to mismatched person counts between eyes
    • Solution: Use left eye detections as "seeds" for right eye processing
    • Impact: Ensures identical person detection across stereo pair
    • Risk: Low, maintains current quality while improving consistency
  2. Cross-Eye Detection Validation

    • Issue: Hair/clothing included on one eye but not the other
    • Solution: Compare detection results, flag inconsistencies for reprocessing
    • Impact: 90%+ stereo alignment improvement
    • Risk: Low, fallback to current behavior
  3. Disparity-Aware Segmentation

    • Issue: Segmentation boundaries differ between eyes despite same person
    • Solution: Use stereo disparity to correlate features between eyes
    • Impact: True stereo-consistent matting
    • Risk: High, complex implementation
  4. Joint Stereo Detection

    • Issue: YOLO runs independently on each eye
    • Solution: Run YOLO on full SBS frame, split detections spatially
    • Impact: Guaranteed identical detection counts
    • Risk: Medium, requires detection coordinate mapping

Category C: Advanced Optimizations (Future)

  1. Adaptive Memory Management

    • Opportunity: Dynamic chunk sizing based on real-time VRAM usage
    • Impact: Optimal resource utilization across different hardware
    • Risk: Medium, complex heuristics
  2. Multi-Resolution Processing

    • Opportunity: Initial processing at lower resolution, edge refinement at full
    • Impact: Speed improvement while maintaining quality
    • Risk: Medium, quality validation required
  3. Enhanced Workflow Documentation

    • Issue: Unclear intermediate data lifecycle
    • Solution: Detailed logging of chunk processing, optional intermediate preservation
    • Impact: Better debugging and user understanding
    • Risk: Low, documentation feature

Implementation Strategy

  • Phase A: Quick performance wins (larger chunks, profiling)
  • Phase B: Stereo consistency (master-slave, validation)
  • Phase C: Advanced features (disparity-aware, memory optimization)

Configuration Extensions Required

processing:
  chunk_size: 600  # Increase from 200 for high-VRAM systems
  memory_pipeline: false  # Skip intermediate video creation (disabled due to RAM limits)
  streaming_output: true  # Write chunks progressively instead of accumulating
  parallel_eyes: false  # Process eyes simultaneously
  max_memory_gb: 40  # Realistic RAM limit for RunPod containers
  
audio:
  preserve_audio: true  # Copy audio track from input to output
  verify_sync: true  # Validate frame count and duration matching
  audio_codec: "copy"  # Preserve original audio codec
  
stereo:
  consistency_mode: "master_slave"  # "independent", "master_slave", "joint"
  validation_threshold: 0.8  # Similarity threshold between eyes
  correction_method: "transfer"  # "transfer", "reprocess", "ensemble"
  
performance:
  profile_enabled: true  # Detailed timing analysis
  preserve_intermediates: false  # For debugging workflow

debugging:
  log_intermediate_workflow: true  # Document chunk lifecycle
  save_detection_visualization: false  # Debug detection mismatches
  frame_count_validation: true  # Ensure exact frame preservation

Technical Implementation Details

Audio Preservation Implementation

# During final video save, include audio stream copy
ffmpeg_cmd = [
    'ffmpeg', '-y',
    '-framerate', str(fps),
    '-i', frame_pattern,           # Video frames
    '-i', input_video_path,        # Original video for audio
    '-c:v', 'h264_nvenc',         # GPU video codec (with CPU fallback)
    '-c:a', 'copy',               # Copy audio without re-encoding
    '-map', '0:v:0',              # Map video from first input
    '-map', '1:a:0',              # Map audio from second input
    '-shortest',                  # Match shortest stream duration
    output_path
]

Streaming Output Implementation

# Instead of accumulating frames in memory:
class StreamingVideoWriter:
    def __init__(self, output_path, fps, audio_source):
        self.temp_segments = []
        self.current_segment = 0
        
    def write_chunk(self, processed_frames):
        # Write chunk to temporary segment
        segment_path = f"temp_segment_{self.current_segment}.mp4"
        self.write_video_segment(processed_frames, segment_path)
        self.temp_segments.append(segment_path)
        self.current_segment += 1
        
    def finalize(self):
        # Merge all segments with audio preservation
        self.merge_segments_with_audio()

Memory Usage Calculation

def estimate_memory_requirements(duration_seconds, fps, resolution_scale=0.5):
    """Calculate memory usage for different video lengths"""
    frames = duration_seconds * fps
    
    # Per-frame memory (rough estimates for VR180 at 50% scale)
    frame_size_mb = (3072 * 1536 * 3 * 4) / (1024 * 1024)  # ~18MB per frame
    
    total_memory_gb = (frames * frame_size_mb) / 1024
    
    return {
        'duration': duration_seconds,
        'total_frames': frames,
        'estimated_memory_gb': total_memory_gb,
        'safe_for_48gb': total_memory_gb < 40
    }

# Example outputs:
# 30 seconds: ~2.7GB (safe)
# 5 minutes: ~27GB (borderline) 
# 1 hour: ~324GB (requires streaming)

Success Criteria

Technical Feasibility

  • Process 30s VR180 clip without manual intervention
  • Maintain <10GB VRAM usage on RTX 3080
  • Achieve acceptable matting quality at 50% scale
  • Complete processing in <15 minutes locally

Quality Benchmarks

  • Clean edges with minimal artifacts
  • Temporal consistency across frames
  • Stereo alignment between left/right eyes
  • Usable results for green screen compositing

Scalability Validation

  • Configuration-driven parameter control
  • Clear performance vs quality tradeoffs identified
  • Docker deployment pathway established
  • Cost/benefit analysis for cloud GPU usage

Risk Mitigation

VRAM Limitations

  • Fallback: Automatic chunking with overlap processing
  • Monitoring: Real-time VRAM usage tracking
  • Graceful Degradation: Quality reduction before failure

Quality Issues

  • Validation Pipeline: Automated quality assessment
  • Manual Override: Optional bounding box adjustment
  • Fallback Methods: Integration points for RVM if needed

Performance Bottlenecks

  • Profiling: Detailed timing analysis per component
  • Optimization: Identify CPU vs GPU bound operations
  • Scaling Strategy: Clear upgrade path to cloud GPUs