8.7 KiB
YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide
Project Overview
This repository implements an automated video processing pipeline specifically designed for VR180 side-by-side stereo videos. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.
Core Purpose
The primary goal is to automatically create green screen videos from VR180 content where:
- Left eye view (left half of frame) contains humans as Object 1 (green masks)
- Right eye view (right half of frame) contains humans as Object 2 (blue masks)
- Background is replaced with pure green (RGB: 0,255,0) for chroma keying
- Original audio is preserved throughout the process
- Processing handles videos of any length through segmentation
Architecture Overview
Pipeline Stages
-
Video Segmentation (
core/video_splitter.py)- Splits long videos into 5-second segments using FFmpeg
- Creates organized directory structure:
segment_0/,segment_1/, etc. - Preserves timestamps and forces keyframes for clean cuts
-
Human Detection (
core/yolo_detector.py)- Uses YOLOv8 for robust human detection in VR180 format
- Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
- Automatically assigns humans to left/right eye based on position in frame
- Saves detection results for reuse and debugging
-
Mask Generation (
core/sam2_processor.py)- Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
- Propagates masks across all frames in each segment
- Supports mask continuity between segments using previous segment's final masks
- Handles VR180 stereo tracking with separate object IDs for each eye
-
Green Screen Processing (
core/mask_processor.py)- Applies generated masks to isolate humans
- Replaces background with green screen
- Uses GPU acceleration (CuPy) for fast processing
- Maintains original video quality and framerate
-
Video Assembly (
core/video_assembler.py)- Concatenates all processed segments into final video
- Preserves original audio track from input video
- Uses hardware encoding (NVENC) when available
Key Components
samyolo_on_segments/
├── main.py # Entry point - orchestrates the pipeline
├── config.yaml # Configuration file (YAML format)
├── core/ # Core processing modules
│ ├── config_loader.py # Configuration management
│ ├── video_splitter.py # FFmpeg-based video segmentation
│ ├── yolo_detector.py # YOLO human detection (detection/segmentation modes)
│ ├── sam2_processor.py # SAM2 mask generation and propagation
│ ├── mask_processor.py # Green screen application
│ └── video_assembler.py # Final video concatenation
├── utils/ # Utility functions
│ ├── file_utils.py # File system operations
│ ├── logging_utils.py # Logging configuration
│ └── status_utils.py # Progress monitoring
└── models/ # Model storage (created by download_models.py)
├── sam2/ # SAM2 checkpoints and configs
└── yolo/ # YOLO model weights
VR180 Specific Features
Stereo Video Handling
- Automatically detects humans in left and right eye views
- Assigns Object ID 1 to left eye humans (green masks)
- Assigns Object ID 2 to right eye humans (blue masks)
- Maintains stereo correspondence throughout segments
Frame Division Logic
- Frame width is divided in half to separate left/right views
- Human detection centers are used to determine eye assignment
- If only one human is detected, it may be duplicated to both eyes (configurable)
Configuration System
The pipeline is controlled via config.yaml with these key sections:
Essential Settings
input:
video_path: "/path/to/vr180_video.mp4"
output:
directory: "/path/to/output/"
filename: "greenscreen_output.mp4"
processing:
segment_duration: 5 # Seconds per segment
inference_scale: 0.5 # Scale for faster processing
yolo_confidence: 0.6 # Detection threshold
detect_segments: "all" # Which segments to process
models:
yolo_model: "models/yolo/yolov8n.pt"
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"
Advanced Options
- YOLO Modes: Switch between detection (bboxes) and segmentation (direct masks)
- Mid-segment Detection: Re-detect humans at intervals within segments
- Mask Quality: Temporal smoothing, morphological operations, edge refinement
- Debug Outputs: Save detection visualizations and first-frame masks
Processing Flow
For First Segment (segment_0):
- Load first frame at inference scale
- Run YOLO to detect humans
- Convert detections to SAM2 prompts (or use YOLO masks directly)
- Initialize SAM2 with prompts/masks
- Propagate masks through all frames
- Apply green screen and save output
- Save final mask for next segment
For Subsequent Segments:
- Check if YOLO detection is requested for this segment
- If yes: Use YOLO detection (same as first segment)
- If no: Load previous segment's final mask
- Initialize SAM2 with previous masks
- Continue propagation through segment
- Apply green screen and save output
Fallback Logic:
- If no previous mask exists, searches backwards through segments
- First segment always requires YOLO detection
- Missing detections can be recovered in later segments
Model Support
YOLO Models
- Detection: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
- Segmentation: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)
SAM2 Models
- Tiny: sam2.1_hiera_tiny.pt (fastest, lowest quality)
- Small: sam2.1_hiera_small.pt
- Base+: sam2.1_hiera_base_plus.pt
- Large: sam2.1_hiera_large.pt (best quality, slowest)
Key Implementation Details
GPU Optimization
- CUDA device selection with MPS fallback
- CuPy for GPU-accelerated mask operations
- NVENC hardware encoding support
- Batch processing where possible
Memory Management
- Segments processed sequentially to limit memory usage
- Explicit garbage collection between segments
- Low-resolution inference with high-resolution rendering
- Configurable scale factors for different stages
Error Handling
- Graceful fallback when masks are unavailable
- Segment-level recovery (can restart individual segments)
- Comprehensive logging at all stages
- Status checking and cleanup utilities
Debugging Features
Status Monitoring
python main.py --config config.yaml --status
Segment Cleanup
python main.py --config config.yaml --cleanup-segment 5
Debug Outputs
yolo_debug.jpg: Bounding box visualizationsfirst_frame_detection.jpg: Initial mask visualizationmask.png: Final segment mask for continuityyolo_detections: Saved detection coordinates
Common Issues and Solutions
No Right Eye Detections in VR180
- Lower
yolo_confidencethreshold (try 0.3-0.4) - Enable debug mode to analyze detection confidence
- Check if person is actually visible in right eye view
Mask Propagation Failures
- Ensure first segment has successful YOLO detections
- Check previous segment's mask.png exists
- Consider re-running YOLO on problem segments
Memory Issues
- Reduce
inference_scale(try 0.25) - Use smaller models (tiny/small variants)
- Process fewer segments at once
Development Notes
Adding Features
- All core modules inherit from base classes in
core/ - Configuration is centralized through
ConfigLoader - Logging uses Python's standard logging module
- File operations go through
utils/file_utils.py
Testing Components
- Each module can be tested independently
- Use
--statusflag to check processing state - Debug outputs help verify each stage
Performance Tuning
- Adjust
inference_scalefor speed vs quality - Use
detect_segmentsto process only key frames - Enable
use_nvencfor hardware encoding - Consider
vos_optimizedmode for SAM2 (experimental)
Original Monolithic Script
The project includes the original working script in spec.md (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:
- Better error handling and recovery
- Configurable processing pipeline
- Debug and monitoring capabilities
- Cleaner code organization