# YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide ## Project Overview This repository implements an automated video processing pipeline specifically designed for **VR180 side-by-side stereo videos**. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output. ## Core Purpose The primary goal is to automatically create green screen videos from VR180 content where: - **Left eye view** (left half of frame) contains humans as Object 1 (green masks) - **Right eye view** (right half of frame) contains humans as Object 2 (blue masks) - Background is replaced with pure green (RGB: 0,255,0) for chroma keying - Original audio is preserved throughout the process - Processing handles videos of any length through segmentation ## Architecture Overview ### Pipeline Stages 1. **Video Segmentation** (`core/video_splitter.py`) - Splits long videos into 5-second segments using FFmpeg - Creates organized directory structure: `segment_0/`, `segment_1/`, etc. - Preserves timestamps and forces keyframes for clean cuts 2. **Human Detection** (`core/yolo_detector.py`) - Uses YOLOv8 for robust human detection in VR180 format - Supports both detection mode (bounding boxes) and segmentation mode (direct masks) - Automatically assigns humans to left/right eye based on position in frame - Saves detection results for reuse and debugging 3. **Mask Generation** (`core/sam2_processor.py`) - Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation - Propagates masks across all frames in each segment - Supports mask continuity between segments using previous segment's final masks - Handles VR180 stereo tracking with separate object IDs for each eye 4. **Green Screen Processing** (`core/mask_processor.py`) - Applies generated masks to isolate humans - Replaces background with green screen - Uses GPU acceleration (CuPy) for fast processing - Maintains original video quality and framerate 5. **Video Assembly** (`core/video_assembler.py`) - Concatenates all processed segments into final video - Preserves original audio track from input video - Uses hardware encoding (NVENC) when available ### Key Components ``` samyolo_on_segments/ ├── main.py # Entry point - orchestrates the pipeline ├── config.yaml # Configuration file (YAML format) ├── core/ # Core processing modules │ ├── config_loader.py # Configuration management │ ├── video_splitter.py # FFmpeg-based video segmentation │ ├── yolo_detector.py # YOLO human detection (detection/segmentation modes) │ ├── sam2_processor.py # SAM2 mask generation and propagation │ ├── mask_processor.py # Green screen application │ └── video_assembler.py # Final video concatenation ├── utils/ # Utility functions │ ├── file_utils.py # File system operations │ ├── logging_utils.py # Logging configuration │ └── status_utils.py # Progress monitoring └── models/ # Model storage (created by download_models.py) ├── sam2/ # SAM2 checkpoints and configs └── yolo/ # YOLO model weights ``` ## VR180 Specific Features ### Stereo Video Handling - Automatically detects humans in left and right eye views - Assigns Object ID 1 to left eye humans (green masks) - Assigns Object ID 2 to right eye humans (blue masks) - Maintains stereo correspondence throughout segments ### Frame Division Logic - Frame width is divided in half to separate left/right views - Human detection centers are used to determine eye assignment - If only one human is detected, it may be duplicated to both eyes (configurable) ## Configuration System The pipeline is controlled via `config.yaml` with these key sections: ### Essential Settings ```yaml input: video_path: "/path/to/vr180_video.mp4" output: directory: "/path/to/output/" filename: "greenscreen_output.mp4" processing: segment_duration: 5 # Seconds per segment inference_scale: 0.5 # Scale for faster processing yolo_confidence: 0.6 # Detection threshold detect_segments: "all" # Which segments to process models: yolo_model: "models/yolo/yolov8n.pt" sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt" sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml" ``` ### Advanced Options - **YOLO Modes**: Switch between detection (bboxes) and segmentation (direct masks) - **Mid-segment Detection**: Re-detect humans at intervals within segments - **Mask Quality**: Temporal smoothing, morphological operations, edge refinement - **Debug Outputs**: Save detection visualizations and first-frame masks ## Processing Flow ### For First Segment (segment_0): 1. Load first frame at inference scale 2. Run YOLO to detect humans 3. Convert detections to SAM2 prompts (or use YOLO masks directly) 4. Initialize SAM2 with prompts/masks 5. Propagate masks through all frames 6. Apply green screen and save output 7. Save final mask for next segment ### For Subsequent Segments: 1. Check if YOLO detection is requested for this segment 2. If yes: Use YOLO detection (same as first segment) 3. If no: Load previous segment's final mask 4. Initialize SAM2 with previous masks 5. Continue propagation through segment 6. Apply green screen and save output ### Fallback Logic: - If no previous mask exists, searches backwards through segments - First segment always requires YOLO detection - Missing detections can be recovered in later segments ## Model Support ### YOLO Models - **Detection**: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only) - **Segmentation**: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output) ### SAM2 Models - **Tiny**: sam2.1_hiera_tiny.pt (fastest, lowest quality) - **Small**: sam2.1_hiera_small.pt - **Base+**: sam2.1_hiera_base_plus.pt - **Large**: sam2.1_hiera_large.pt (best quality, slowest) ## Key Implementation Details ### GPU Optimization - CUDA device selection with MPS fallback - CuPy for GPU-accelerated mask operations - NVENC hardware encoding support - Batch processing where possible ### Memory Management - Segments processed sequentially to limit memory usage - Explicit garbage collection between segments - Low-resolution inference with high-resolution rendering - Configurable scale factors for different stages ### Error Handling - Graceful fallback when masks are unavailable - Segment-level recovery (can restart individual segments) - Comprehensive logging at all stages - Status checking and cleanup utilities ## Debugging Features ### Status Monitoring ```bash python main.py --config config.yaml --status ``` ### Segment Cleanup ```bash python main.py --config config.yaml --cleanup-segment 5 ``` ### Debug Outputs - `yolo_debug.jpg`: Bounding box visualizations - `first_frame_detection.jpg`: Initial mask visualization - `mask.png`: Final segment mask for continuity - `yolo_detections`: Saved detection coordinates ## Common Issues and Solutions ### No Right Eye Detections in VR180 - Lower `yolo_confidence` threshold (try 0.3-0.4) - Enable debug mode to analyze detection confidence - Check if person is actually visible in right eye view ### Mask Propagation Failures - Ensure first segment has successful YOLO detections - Check previous segment's mask.png exists - Consider re-running YOLO on problem segments ### Memory Issues - Reduce `inference_scale` (try 0.25) - Use smaller models (tiny/small variants) - Process fewer segments at once ## Development Notes ### Adding Features - All core modules inherit from base classes in `core/` - Configuration is centralized through `ConfigLoader` - Logging uses Python's standard logging module - File operations go through `utils/file_utils.py` ### Testing Components - Each module can be tested independently - Use `--status` flag to check processing state - Debug outputs help verify each stage ### Performance Tuning - Adjust `inference_scale` for speed vs quality - Use `detect_segments` to process only key frames - Enable `use_nvenc` for hardware encoding - Consider `vos_optimized` mode for SAM2 (experimental) ## Original Monolithic Script The project includes the original working script in `spec.md` (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding: - Better error handling and recovery - Configurable processing pipeline - Debug and monitoring capabilities - Cleaner code organization