Files
test2/spec.md
2025-07-26 07:23:50 -07:00

5.9 KiB

VR180 Human Matting Proof of Concept - Det-SAM2 Approach

Project Overview

A proof-of-concept implementation to test the feasibility of using Det-SAM2 for automated human matting on VR180 3D side-by-side equirectangular video. The system will process a 30-second test clip to evaluate quality, performance, and resource requirements on local RTX 3080 hardware, with design considerations for cloud GPU scaling.

Input Specifications

  • Format: VR180 3D side-by-side equirectangular video
  • Resolution: 6144x3072 (3072x3072 per eye)
  • Test Duration: 30 seconds
  • Layout: Left eye (0-3071px), Right eye (3072-6143px)

Core Functionality

Automatic Person Detection

  • Method: YOLOv8 integration with Det-SAM2
  • Detection: Automatic bounding box placement on all humans
  • Minimal Manual Input: Fully automated pipeline with no point selection required

Processing Strategy

  • Primary Approach: Process both eyes using disparity mapping optimization
  • Fallback: Independent processing per eye if disparity mapping proves complex
  • Chunking: Adaptive segmentation (full 30s clip preferred, fallback to smaller chunks if VRAM limited)

Scaling and Quality Options

  • Resolution Scaling: 25%, 50%, or 100% processing resolution
  • Mask Upscaling: AI-based upscaling to full resolution for final output
  • Quality vs Performance: Configurable tradeoffs for local vs cloud processing

Configuration System

YAML/TOML Configuration File

input:
  video_path: "path/to/input.mp4"
  
processing:
  scale_factor: 0.5  # 0.25, 0.5, 1.0
  chunk_size: 900    # frames, 0 for full video
  overlap_frames: 60 # for chunked processing
  
detection:
  confidence_threshold: 0.7
  model: "yolov8n"   # yolov8n, yolov8s, yolov8m
  
matting:
  use_disparity_mapping: true
  memory_offload: true
  fp16: true
  
output:
  path: "path/to/output/"
  format: "alpha"     # "alpha" or "greenscreen"
  background_color: [0, 255, 0]  # for greenscreen
  maintain_sbs: true  # keep side-by-side format
  
hardware:
  device: "cuda"
  max_vram_gb: 10     # RTX 3080 limit

Technical Implementation

Memory Optimization (Det-SAM2 Enhancements)

  • CPU Offloading: offload_video_to_cpu=True
  • FP16 Storage: Reduce memory usage by ~50%
  • Frame Release: release_old_frames() for constant VRAM usage
  • Adaptive Chunking: Automatic chunk size based on available VRAM

VR180-Specific Optimizations

  • Stereo Processing: Leverage disparity mapping for efficiency
  • Cross-Eye Validation: Ensure consistency between left/right views
  • Edge Refinement: Multi-resolution processing for clean matting boundaries

Output Options

  • Alpha Channel: Transparent PNG sequence or video with alpha
  • Green Screen: Configurable background color for traditional keying
  • Format Preservation: Maintain original SBS layout or output separate eyes

Performance Targets

Local RTX 3080 (10GB VRAM)

  • 25% Scale: ~5-8 FPS processing, ~6 minutes for 30s clip
  • 50% Scale: ~3-5 FPS processing, ~10 minutes for 30s clip
  • 100% Scale: Chunked processing required, ~15-20 minutes for 30s clip

Cloud GPU Scaling (Future)

  • Design Considerations: Docker containerization ready
  • Provider Agnostic: Compatible with RunPod, Vast.ai, etc.
  • Batch Processing: Queue-based job distribution
  • Cost Estimation: Target $0.10-0.50 per 30s clip processing

Quality Assessment Features

Automated Quality Metrics

  • Edge Consistency: Measure aliasing and stair-stepping
  • Temporal Stability: Frame-to-frame consistency scoring
  • Stereo Alignment: Left/right eye correspondence validation

Debug/Analysis Outputs

  • Detection Visualization: Bounding boxes overlaid on frames
  • Confidence Maps: Per-pixel matting confidence scores
  • Processing Stats: VRAM usage, FPS, chunk information

Deliverables

Phase 1: Core Implementation

  1. Det-SAM2 Integration: Automatic detection pipeline
  2. VRAM Optimization: Memory management for RTX 3080
  3. Basic Matting: Single-resolution processing
  4. Configuration System: YAML-based parameter control

Phase 2: VR180 Optimization

  1. Disparity Processing: Stereo-aware matting
  2. Multi-Resolution: Scaling and upsampling pipeline
  3. Quality Assessment: Automated metrics and visualization
  4. Edge Refinement: Anti-aliasing and boundary smoothing

Phase 3: Production Ready

  1. Cloud GPU Support: Docker containerization
  2. Batch Processing: Multiple video queue system
  3. Performance Profiling: Detailed resource usage analytics
  4. Quality Validation: Comprehensive testing suite

Success Criteria

Technical Feasibility

  • Process 30s VR180 clip without manual intervention
  • Maintain <10GB VRAM usage on RTX 3080
  • Achieve acceptable matting quality at 50% scale
  • Complete processing in <15 minutes locally

Quality Benchmarks

  • Clean edges with minimal artifacts
  • Temporal consistency across frames
  • Stereo alignment between left/right eyes
  • Usable results for green screen compositing

Scalability Validation

  • Configuration-driven parameter control
  • Clear performance vs quality tradeoffs identified
  • Docker deployment pathway established
  • Cost/benefit analysis for cloud GPU usage

Risk Mitigation

VRAM Limitations

  • Fallback: Automatic chunking with overlap processing
  • Monitoring: Real-time VRAM usage tracking
  • Graceful Degradation: Quality reduction before failure

Quality Issues

  • Validation Pipeline: Automated quality assessment
  • Manual Override: Optional bounding box adjustment
  • Fallback Methods: Integration points for RVM if needed

Performance Bottlenecks

  • Profiling: Detailed timing analysis per component
  • Optimization: Identify CPU vs GPU bound operations
  • Scaling Strategy: Clear upgrade path to cloud GPUs