stereo mask working
This commit is contained in:
230
claude.md
Normal file
230
claude.md
Normal file
@@ -0,0 +1,230 @@
|
||||
# YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide
|
||||
|
||||
## Project Overview
|
||||
|
||||
This repository implements an automated video processing pipeline specifically designed for **VR180 side-by-side stereo videos**. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.
|
||||
|
||||
## Core Purpose
|
||||
|
||||
The primary goal is to automatically create green screen videos from VR180 content where:
|
||||
- **Left eye view** (left half of frame) contains humans as Object 1 (green masks)
|
||||
- **Right eye view** (right half of frame) contains humans as Object 2 (blue masks)
|
||||
- Background is replaced with pure green (RGB: 0,255,0) for chroma keying
|
||||
- Original audio is preserved throughout the process
|
||||
- Processing handles videos of any length through segmentation
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### Pipeline Stages
|
||||
|
||||
1. **Video Segmentation** (`core/video_splitter.py`)
|
||||
- Splits long videos into 5-second segments using FFmpeg
|
||||
- Creates organized directory structure: `segment_0/`, `segment_1/`, etc.
|
||||
- Preserves timestamps and forces keyframes for clean cuts
|
||||
|
||||
2. **Human Detection** (`core/yolo_detector.py`)
|
||||
- Uses YOLOv8 for robust human detection in VR180 format
|
||||
- Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
|
||||
- Automatically assigns humans to left/right eye based on position in frame
|
||||
- Saves detection results for reuse and debugging
|
||||
|
||||
3. **Mask Generation** (`core/sam2_processor.py`)
|
||||
- Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
|
||||
- Propagates masks across all frames in each segment
|
||||
- Supports mask continuity between segments using previous segment's final masks
|
||||
- Handles VR180 stereo tracking with separate object IDs for each eye
|
||||
|
||||
4. **Green Screen Processing** (`core/mask_processor.py`)
|
||||
- Applies generated masks to isolate humans
|
||||
- Replaces background with green screen
|
||||
- Uses GPU acceleration (CuPy) for fast processing
|
||||
- Maintains original video quality and framerate
|
||||
|
||||
5. **Video Assembly** (`core/video_assembler.py`)
|
||||
- Concatenates all processed segments into final video
|
||||
- Preserves original audio track from input video
|
||||
- Uses hardware encoding (NVENC) when available
|
||||
|
||||
### Key Components
|
||||
|
||||
```
|
||||
samyolo_on_segments/
|
||||
├── main.py # Entry point - orchestrates the pipeline
|
||||
├── config.yaml # Configuration file (YAML format)
|
||||
├── core/ # Core processing modules
|
||||
│ ├── config_loader.py # Configuration management
|
||||
│ ├── video_splitter.py # FFmpeg-based video segmentation
|
||||
│ ├── yolo_detector.py # YOLO human detection (detection/segmentation modes)
|
||||
│ ├── sam2_processor.py # SAM2 mask generation and propagation
|
||||
│ ├── mask_processor.py # Green screen application
|
||||
│ └── video_assembler.py # Final video concatenation
|
||||
├── utils/ # Utility functions
|
||||
│ ├── file_utils.py # File system operations
|
||||
│ ├── logging_utils.py # Logging configuration
|
||||
│ └── status_utils.py # Progress monitoring
|
||||
└── models/ # Model storage (created by download_models.py)
|
||||
├── sam2/ # SAM2 checkpoints and configs
|
||||
└── yolo/ # YOLO model weights
|
||||
```
|
||||
|
||||
## VR180 Specific Features
|
||||
|
||||
### Stereo Video Handling
|
||||
- Automatically detects humans in left and right eye views
|
||||
- Assigns Object ID 1 to left eye humans (green masks)
|
||||
- Assigns Object ID 2 to right eye humans (blue masks)
|
||||
- Maintains stereo correspondence throughout segments
|
||||
|
||||
### Frame Division Logic
|
||||
- Frame width is divided in half to separate left/right views
|
||||
- Human detection centers are used to determine eye assignment
|
||||
- If only one human is detected, it may be duplicated to both eyes (configurable)
|
||||
|
||||
## Configuration System
|
||||
|
||||
The pipeline is controlled via `config.yaml` with these key sections:
|
||||
|
||||
### Essential Settings
|
||||
```yaml
|
||||
input:
|
||||
video_path: "/path/to/vr180_video.mp4"
|
||||
|
||||
output:
|
||||
directory: "/path/to/output/"
|
||||
filename: "greenscreen_output.mp4"
|
||||
|
||||
processing:
|
||||
segment_duration: 5 # Seconds per segment
|
||||
inference_scale: 0.5 # Scale for faster processing
|
||||
yolo_confidence: 0.6 # Detection threshold
|
||||
detect_segments: "all" # Which segments to process
|
||||
|
||||
models:
|
||||
yolo_model: "models/yolo/yolov8n.pt"
|
||||
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
|
||||
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"
|
||||
```
|
||||
|
||||
### Advanced Options
|
||||
- **YOLO Modes**: Switch between detection (bboxes) and segmentation (direct masks)
|
||||
- **Mid-segment Detection**: Re-detect humans at intervals within segments
|
||||
- **Mask Quality**: Temporal smoothing, morphological operations, edge refinement
|
||||
- **Debug Outputs**: Save detection visualizations and first-frame masks
|
||||
|
||||
## Processing Flow
|
||||
|
||||
### For First Segment (segment_0):
|
||||
1. Load first frame at inference scale
|
||||
2. Run YOLO to detect humans
|
||||
3. Convert detections to SAM2 prompts (or use YOLO masks directly)
|
||||
4. Initialize SAM2 with prompts/masks
|
||||
5. Propagate masks through all frames
|
||||
6. Apply green screen and save output
|
||||
7. Save final mask for next segment
|
||||
|
||||
### For Subsequent Segments:
|
||||
1. Check if YOLO detection is requested for this segment
|
||||
2. If yes: Use YOLO detection (same as first segment)
|
||||
3. If no: Load previous segment's final mask
|
||||
4. Initialize SAM2 with previous masks
|
||||
5. Continue propagation through segment
|
||||
6. Apply green screen and save output
|
||||
|
||||
### Fallback Logic:
|
||||
- If no previous mask exists, searches backwards through segments
|
||||
- First segment always requires YOLO detection
|
||||
- Missing detections can be recovered in later segments
|
||||
|
||||
## Model Support
|
||||
|
||||
### YOLO Models
|
||||
- **Detection**: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
|
||||
- **Segmentation**: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)
|
||||
|
||||
### SAM2 Models
|
||||
- **Tiny**: sam2.1_hiera_tiny.pt (fastest, lowest quality)
|
||||
- **Small**: sam2.1_hiera_small.pt
|
||||
- **Base+**: sam2.1_hiera_base_plus.pt
|
||||
- **Large**: sam2.1_hiera_large.pt (best quality, slowest)
|
||||
|
||||
## Key Implementation Details
|
||||
|
||||
### GPU Optimization
|
||||
- CUDA device selection with MPS fallback
|
||||
- CuPy for GPU-accelerated mask operations
|
||||
- NVENC hardware encoding support
|
||||
- Batch processing where possible
|
||||
|
||||
### Memory Management
|
||||
- Segments processed sequentially to limit memory usage
|
||||
- Explicit garbage collection between segments
|
||||
- Low-resolution inference with high-resolution rendering
|
||||
- Configurable scale factors for different stages
|
||||
|
||||
### Error Handling
|
||||
- Graceful fallback when masks are unavailable
|
||||
- Segment-level recovery (can restart individual segments)
|
||||
- Comprehensive logging at all stages
|
||||
- Status checking and cleanup utilities
|
||||
|
||||
## Debugging Features
|
||||
|
||||
### Status Monitoring
|
||||
```bash
|
||||
python main.py --config config.yaml --status
|
||||
```
|
||||
|
||||
### Segment Cleanup
|
||||
```bash
|
||||
python main.py --config config.yaml --cleanup-segment 5
|
||||
```
|
||||
|
||||
### Debug Outputs
|
||||
- `yolo_debug.jpg`: Bounding box visualizations
|
||||
- `first_frame_detection.jpg`: Initial mask visualization
|
||||
- `mask.png`: Final segment mask for continuity
|
||||
- `yolo_detections`: Saved detection coordinates
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### No Right Eye Detections in VR180
|
||||
- Lower `yolo_confidence` threshold (try 0.3-0.4)
|
||||
- Enable debug mode to analyze detection confidence
|
||||
- Check if person is actually visible in right eye view
|
||||
|
||||
### Mask Propagation Failures
|
||||
- Ensure first segment has successful YOLO detections
|
||||
- Check previous segment's mask.png exists
|
||||
- Consider re-running YOLO on problem segments
|
||||
|
||||
### Memory Issues
|
||||
- Reduce `inference_scale` (try 0.25)
|
||||
- Use smaller models (tiny/small variants)
|
||||
- Process fewer segments at once
|
||||
|
||||
## Development Notes
|
||||
|
||||
### Adding Features
|
||||
- All core modules inherit from base classes in `core/`
|
||||
- Configuration is centralized through `ConfigLoader`
|
||||
- Logging uses Python's standard logging module
|
||||
- File operations go through `utils/file_utils.py`
|
||||
|
||||
### Testing Components
|
||||
- Each module can be tested independently
|
||||
- Use `--status` flag to check processing state
|
||||
- Debug outputs help verify each stage
|
||||
|
||||
### Performance Tuning
|
||||
- Adjust `inference_scale` for speed vs quality
|
||||
- Use `detect_segments` to process only key frames
|
||||
- Enable `use_nvenc` for hardware encoding
|
||||
- Consider `vos_optimized` mode for SAM2 (experimental)
|
||||
|
||||
## Original Monolithic Script
|
||||
|
||||
The project includes the original working script in `spec.md` (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:
|
||||
- Better error handling and recovery
|
||||
- Configurable processing pipeline
|
||||
- Debug and monitoring capabilities
|
||||
- Cleaner code organization
|
||||
Reference in New Issue
Block a user