first commit
This commit is contained in:
161
spec.md
Normal file
161
spec.md
Normal file
@@ -0,0 +1,161 @@
|
||||
# VR180 Human Matting Proof of Concept - Det-SAM2 Approach
|
||||
|
||||
## Project Overview
|
||||
|
||||
A proof-of-concept implementation to test the feasibility of using Det-SAM2 for automated human matting on VR180 3D side-by-side equirectangular video. The system will process a 30-second test clip to evaluate quality, performance, and resource requirements on local RTX 3080 hardware, with design considerations for cloud GPU scaling.
|
||||
|
||||
## Input Specifications
|
||||
|
||||
- **Format**: VR180 3D side-by-side equirectangular video
|
||||
- **Resolution**: 6144x3072 (3072x3072 per eye)
|
||||
- **Test Duration**: 30 seconds
|
||||
- **Layout**: Left eye (0-3071px), Right eye (3072-6143px)
|
||||
|
||||
## Core Functionality
|
||||
|
||||
### Automatic Person Detection
|
||||
- **Method**: YOLOv8 integration with Det-SAM2
|
||||
- **Detection**: Automatic bounding box placement on all humans
|
||||
- **Minimal Manual Input**: Fully automated pipeline with no point selection required
|
||||
|
||||
### Processing Strategy
|
||||
- **Primary Approach**: Process both eyes using disparity mapping optimization
|
||||
- **Fallback**: Independent processing per eye if disparity mapping proves complex
|
||||
- **Chunking**: Adaptive segmentation (full 30s clip preferred, fallback to smaller chunks if VRAM limited)
|
||||
|
||||
### Scaling and Quality Options
|
||||
- **Resolution Scaling**: 25%, 50%, or 100% processing resolution
|
||||
- **Mask Upscaling**: AI-based upscaling to full resolution for final output
|
||||
- **Quality vs Performance**: Configurable tradeoffs for local vs cloud processing
|
||||
|
||||
## Configuration System
|
||||
|
||||
### YAML/TOML Configuration File
|
||||
```yaml
|
||||
input:
|
||||
video_path: "path/to/input.mp4"
|
||||
|
||||
processing:
|
||||
scale_factor: 0.5 # 0.25, 0.5, 1.0
|
||||
chunk_size: 900 # frames, 0 for full video
|
||||
overlap_frames: 60 # for chunked processing
|
||||
|
||||
detection:
|
||||
confidence_threshold: 0.7
|
||||
model: "yolov8n" # yolov8n, yolov8s, yolov8m
|
||||
|
||||
matting:
|
||||
use_disparity_mapping: true
|
||||
memory_offload: true
|
||||
fp16: true
|
||||
|
||||
output:
|
||||
path: "path/to/output/"
|
||||
format: "alpha" # "alpha" or "greenscreen"
|
||||
background_color: [0, 255, 0] # for greenscreen
|
||||
maintain_sbs: true # keep side-by-side format
|
||||
|
||||
hardware:
|
||||
device: "cuda"
|
||||
max_vram_gb: 10 # RTX 3080 limit
|
||||
```
|
||||
|
||||
## Technical Implementation
|
||||
|
||||
### Memory Optimization (Det-SAM2 Enhancements)
|
||||
- **CPU Offloading**: `offload_video_to_cpu=True`
|
||||
- **FP16 Storage**: Reduce memory usage by ~50%
|
||||
- **Frame Release**: `release_old_frames()` for constant VRAM usage
|
||||
- **Adaptive Chunking**: Automatic chunk size based on available VRAM
|
||||
|
||||
### VR180-Specific Optimizations
|
||||
- **Stereo Processing**: Leverage disparity mapping for efficiency
|
||||
- **Cross-Eye Validation**: Ensure consistency between left/right views
|
||||
- **Edge Refinement**: Multi-resolution processing for clean matting boundaries
|
||||
|
||||
### Output Options
|
||||
- **Alpha Channel**: Transparent PNG sequence or video with alpha
|
||||
- **Green Screen**: Configurable background color for traditional keying
|
||||
- **Format Preservation**: Maintain original SBS layout or output separate eyes
|
||||
|
||||
## Performance Targets
|
||||
|
||||
### Local RTX 3080 (10GB VRAM)
|
||||
- **25% Scale**: ~5-8 FPS processing, ~6 minutes for 30s clip
|
||||
- **50% Scale**: ~3-5 FPS processing, ~10 minutes for 30s clip
|
||||
- **100% Scale**: Chunked processing required, ~15-20 minutes for 30s clip
|
||||
|
||||
### Cloud GPU Scaling (Future)
|
||||
- **Design Considerations**: Docker containerization ready
|
||||
- **Provider Agnostic**: Compatible with RunPod, Vast.ai, etc.
|
||||
- **Batch Processing**: Queue-based job distribution
|
||||
- **Cost Estimation**: Target $0.10-0.50 per 30s clip processing
|
||||
|
||||
## Quality Assessment Features
|
||||
|
||||
### Automated Quality Metrics
|
||||
- **Edge Consistency**: Measure aliasing and stair-stepping
|
||||
- **Temporal Stability**: Frame-to-frame consistency scoring
|
||||
- **Stereo Alignment**: Left/right eye correspondence validation
|
||||
|
||||
### Debug/Analysis Outputs
|
||||
- **Detection Visualization**: Bounding boxes overlaid on frames
|
||||
- **Confidence Maps**: Per-pixel matting confidence scores
|
||||
- **Processing Stats**: VRAM usage, FPS, chunk information
|
||||
|
||||
## Deliverables
|
||||
|
||||
### Phase 1: Core Implementation
|
||||
1. **Det-SAM2 Integration**: Automatic detection pipeline
|
||||
2. **VRAM Optimization**: Memory management for RTX 3080
|
||||
3. **Basic Matting**: Single-resolution processing
|
||||
4. **Configuration System**: YAML-based parameter control
|
||||
|
||||
### Phase 2: VR180 Optimization
|
||||
1. **Disparity Processing**: Stereo-aware matting
|
||||
2. **Multi-Resolution**: Scaling and upsampling pipeline
|
||||
3. **Quality Assessment**: Automated metrics and visualization
|
||||
4. **Edge Refinement**: Anti-aliasing and boundary smoothing
|
||||
|
||||
### Phase 3: Production Ready
|
||||
1. **Cloud GPU Support**: Docker containerization
|
||||
2. **Batch Processing**: Multiple video queue system
|
||||
3. **Performance Profiling**: Detailed resource usage analytics
|
||||
4. **Quality Validation**: Comprehensive testing suite
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Technical Feasibility
|
||||
- [ ] Process 30s VR180 clip without manual intervention
|
||||
- [ ] Maintain <10GB VRAM usage on RTX 3080
|
||||
- [ ] Achieve acceptable matting quality at 50% scale
|
||||
- [ ] Complete processing in <15 minutes locally
|
||||
|
||||
### Quality Benchmarks
|
||||
- [ ] Clean edges with minimal artifacts
|
||||
- [ ] Temporal consistency across frames
|
||||
- [ ] Stereo alignment between left/right eyes
|
||||
- [ ] Usable results for green screen compositing
|
||||
|
||||
### Scalability Validation
|
||||
- [ ] Configuration-driven parameter control
|
||||
- [ ] Clear performance vs quality tradeoffs identified
|
||||
- [ ] Docker deployment pathway established
|
||||
- [ ] Cost/benefit analysis for cloud GPU usage
|
||||
|
||||
## Risk Mitigation
|
||||
|
||||
### VRAM Limitations
|
||||
- **Fallback**: Automatic chunking with overlap processing
|
||||
- **Monitoring**: Real-time VRAM usage tracking
|
||||
- **Graceful Degradation**: Quality reduction before failure
|
||||
|
||||
### Quality Issues
|
||||
- **Validation Pipeline**: Automated quality assessment
|
||||
- **Manual Override**: Optional bounding box adjustment
|
||||
- **Fallback Methods**: Integration points for RVM if needed
|
||||
|
||||
### Performance Bottlenecks
|
||||
- **Profiling**: Detailed timing analysis per component
|
||||
- **Optimization**: Identify CPU vs GPU bound operations
|
||||
- **Scaling Strategy**: Clear upgrade path to cloud GPUs
|
||||
Reference in New Issue
Block a user