Files
samyolo_on_segments/claude.md
2025-07-31 11:13:31 -07:00

230 lines
8.7 KiB
Markdown

# YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide
## Project Overview
This repository implements an automated video processing pipeline specifically designed for **VR180 side-by-side stereo videos**. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.
## Core Purpose
The primary goal is to automatically create green screen videos from VR180 content where:
- **Left eye view** (left half of frame) contains humans as Object 1 (green masks)
- **Right eye view** (right half of frame) contains humans as Object 2 (blue masks)
- Background is replaced with pure green (RGB: 0,255,0) for chroma keying
- Original audio is preserved throughout the process
- Processing handles videos of any length through segmentation
## Architecture Overview
### Pipeline Stages
1. **Video Segmentation** (`core/video_splitter.py`)
- Splits long videos into 5-second segments using FFmpeg
- Creates organized directory structure: `segment_0/`, `segment_1/`, etc.
- Preserves timestamps and forces keyframes for clean cuts
2. **Human Detection** (`core/yolo_detector.py`)
- Uses YOLOv8 for robust human detection in VR180 format
- Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
- Automatically assigns humans to left/right eye based on position in frame
- Saves detection results for reuse and debugging
3. **Mask Generation** (`core/sam2_processor.py`)
- Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
- Propagates masks across all frames in each segment
- Supports mask continuity between segments using previous segment's final masks
- Handles VR180 stereo tracking with separate object IDs for each eye
4. **Green Screen Processing** (`core/mask_processor.py`)
- Applies generated masks to isolate humans
- Replaces background with green screen
- Uses GPU acceleration (CuPy) for fast processing
- Maintains original video quality and framerate
5. **Video Assembly** (`core/video_assembler.py`)
- Concatenates all processed segments into final video
- Preserves original audio track from input video
- Uses hardware encoding (NVENC) when available
### Key Components
```
samyolo_on_segments/
├── main.py # Entry point - orchestrates the pipeline
├── config.yaml # Configuration file (YAML format)
├── core/ # Core processing modules
│ ├── config_loader.py # Configuration management
│ ├── video_splitter.py # FFmpeg-based video segmentation
│ ├── yolo_detector.py # YOLO human detection (detection/segmentation modes)
│ ├── sam2_processor.py # SAM2 mask generation and propagation
│ ├── mask_processor.py # Green screen application
│ └── video_assembler.py # Final video concatenation
├── utils/ # Utility functions
│ ├── file_utils.py # File system operations
│ ├── logging_utils.py # Logging configuration
│ └── status_utils.py # Progress monitoring
└── models/ # Model storage (created by download_models.py)
├── sam2/ # SAM2 checkpoints and configs
└── yolo/ # YOLO model weights
```
## VR180 Specific Features
### Stereo Video Handling
- Automatically detects humans in left and right eye views
- Assigns Object ID 1 to left eye humans (green masks)
- Assigns Object ID 2 to right eye humans (blue masks)
- Maintains stereo correspondence throughout segments
### Frame Division Logic
- Frame width is divided in half to separate left/right views
- Human detection centers are used to determine eye assignment
- If only one human is detected, it may be duplicated to both eyes (configurable)
## Configuration System
The pipeline is controlled via `config.yaml` with these key sections:
### Essential Settings
```yaml
input:
video_path: "/path/to/vr180_video.mp4"
output:
directory: "/path/to/output/"
filename: "greenscreen_output.mp4"
processing:
segment_duration: 5 # Seconds per segment
inference_scale: 0.5 # Scale for faster processing
yolo_confidence: 0.6 # Detection threshold
detect_segments: "all" # Which segments to process
models:
yolo_model: "models/yolo/yolov8n.pt"
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"
```
### Advanced Options
- **YOLO Modes**: Switch between detection (bboxes) and segmentation (direct masks)
- **Mid-segment Detection**: Re-detect humans at intervals within segments
- **Mask Quality**: Temporal smoothing, morphological operations, edge refinement
- **Debug Outputs**: Save detection visualizations and first-frame masks
## Processing Flow
### For First Segment (segment_0):
1. Load first frame at inference scale
2. Run YOLO to detect humans
3. Convert detections to SAM2 prompts (or use YOLO masks directly)
4. Initialize SAM2 with prompts/masks
5. Propagate masks through all frames
6. Apply green screen and save output
7. Save final mask for next segment
### For Subsequent Segments:
1. Check if YOLO detection is requested for this segment
2. If yes: Use YOLO detection (same as first segment)
3. If no: Load previous segment's final mask
4. Initialize SAM2 with previous masks
5. Continue propagation through segment
6. Apply green screen and save output
### Fallback Logic:
- If no previous mask exists, searches backwards through segments
- First segment always requires YOLO detection
- Missing detections can be recovered in later segments
## Model Support
### YOLO Models
- **Detection**: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
- **Segmentation**: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)
### SAM2 Models
- **Tiny**: sam2.1_hiera_tiny.pt (fastest, lowest quality)
- **Small**: sam2.1_hiera_small.pt
- **Base+**: sam2.1_hiera_base_plus.pt
- **Large**: sam2.1_hiera_large.pt (best quality, slowest)
## Key Implementation Details
### GPU Optimization
- CUDA device selection with MPS fallback
- CuPy for GPU-accelerated mask operations
- NVENC hardware encoding support
- Batch processing where possible
### Memory Management
- Segments processed sequentially to limit memory usage
- Explicit garbage collection between segments
- Low-resolution inference with high-resolution rendering
- Configurable scale factors for different stages
### Error Handling
- Graceful fallback when masks are unavailable
- Segment-level recovery (can restart individual segments)
- Comprehensive logging at all stages
- Status checking and cleanup utilities
## Debugging Features
### Status Monitoring
```bash
python main.py --config config.yaml --status
```
### Segment Cleanup
```bash
python main.py --config config.yaml --cleanup-segment 5
```
### Debug Outputs
- `yolo_debug.jpg`: Bounding box visualizations
- `first_frame_detection.jpg`: Initial mask visualization
- `mask.png`: Final segment mask for continuity
- `yolo_detections`: Saved detection coordinates
## Common Issues and Solutions
### No Right Eye Detections in VR180
- Lower `yolo_confidence` threshold (try 0.3-0.4)
- Enable debug mode to analyze detection confidence
- Check if person is actually visible in right eye view
### Mask Propagation Failures
- Ensure first segment has successful YOLO detections
- Check previous segment's mask.png exists
- Consider re-running YOLO on problem segments
### Memory Issues
- Reduce `inference_scale` (try 0.25)
- Use smaller models (tiny/small variants)
- Process fewer segments at once
## Development Notes
### Adding Features
- All core modules inherit from base classes in `core/`
- Configuration is centralized through `ConfigLoader`
- Logging uses Python's standard logging module
- File operations go through `utils/file_utils.py`
### Testing Components
- Each module can be tested independently
- Use `--status` flag to check processing state
- Debug outputs help verify each stage
### Performance Tuning
- Adjust `inference_scale` for speed vs quality
- Use `detect_segments` to process only key frames
- Enable `use_nvenc` for hardware encoding
- Consider `vos_optimized` mode for SAM2 (experimental)
## Original Monolithic Script
The project includes the original working script in `spec.md` (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:
- Better error handling and recovery
- Configurable processing pipeline
- Debug and monitoring capabilities
- Cleaner code organization