# YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide

## Project Overview

This repository implements an automated video processing pipeline specifically designed for **VR180 side-by-side stereo videos**. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.

## Core Purpose

The primary goal is to automatically create green screen videos from VR180 content where:
- **Left eye view** (left half of frame) contains humans as Object 1 (green masks)
- **Right eye view** (right half of frame) contains humans as Object 2 (blue masks)
- Background is replaced with pure green (RGB: 0,255,0) for chroma keying
- Original audio is preserved throughout the process
- Processing handles videos of any length through segmentation

## Architecture Overview

### Pipeline Stages

1. **Video Segmentation** (`core/video_splitter.py`)
   - Splits long videos into 5-second segments using FFmpeg
   - Creates organized directory structure: `segment_0/`, `segment_1/`, etc.
   - Preserves timestamps and forces keyframes for clean cuts

2. **Human Detection** (`core/yolo_detector.py`)
   - Uses YOLOv8 for robust human detection in VR180 format
   - Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
   - Automatically assigns humans to left/right eye based on position in frame
   - Saves detection results for reuse and debugging

3. **Mask Generation** (`core/sam2_processor.py`)
   - Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
   - Propagates masks across all frames in each segment
   - Supports mask continuity between segments using previous segment's final masks
   - Handles VR180 stereo tracking with separate object IDs for each eye

4. **Green Screen Processing** (`core/mask_processor.py`)
   - Applies generated masks to isolate humans
   - Replaces background with green screen
   - Uses GPU acceleration (CuPy) for fast processing
   - Maintains original video quality and framerate

5. **Video Assembly** (`core/video_assembler.py`)
   - Concatenates all processed segments into final video
   - Preserves original audio track from input video
   - Uses hardware encoding (NVENC) when available

### Key Components

```
samyolo_on_segments/
├── main.py                 # Entry point - orchestrates the pipeline
├── config.yaml            # Configuration file (YAML format)
├── core/                  # Core processing modules
│   ├── config_loader.py   # Configuration management
│   ├── video_splitter.py  # FFmpeg-based video segmentation
│   ├── yolo_detector.py   # YOLO human detection (detection/segmentation modes)
│   ├── sam2_processor.py  # SAM2 mask generation and propagation
│   ├── mask_processor.py  # Green screen application
│   └── video_assembler.py # Final video concatenation
├── utils/                 # Utility functions
│   ├── file_utils.py      # File system operations
│   ├── logging_utils.py   # Logging configuration
│   └── status_utils.py    # Progress monitoring
└── models/                # Model storage (created by download_models.py)
    ├── sam2/             # SAM2 checkpoints and configs
    └── yolo/             # YOLO model weights
```

## VR180 Specific Features

### Stereo Video Handling
- Automatically detects humans in left and right eye views
- Assigns Object ID 1 to left eye humans (green masks)
- Assigns Object ID 2 to right eye humans (blue masks)
- Maintains stereo correspondence throughout segments

### Frame Division Logic
- Frame width is divided in half to separate left/right views
- Human detection centers are used to determine eye assignment
- If only one human is detected, it may be duplicated to both eyes (configurable)

## Configuration System

The pipeline is controlled via `config.yaml` with these key sections:

### Essential Settings
```yaml
input:
  video_path: "/path/to/vr180_video.mp4"

output:
  directory: "/path/to/output/"
  filename: "greenscreen_output.mp4"

processing:
  segment_duration: 5              # Seconds per segment
  inference_scale: 0.5            # Scale for faster processing
  yolo_confidence: 0.6            # Detection threshold
  detect_segments: "all"          # Which segments to process

models:
  yolo_model: "models/yolo/yolov8n.pt"
  sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
  sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"
```

### Advanced Options
- **YOLO Modes**: Switch between detection (bboxes) and segmentation (direct masks)
- **Mid-segment Detection**: Re-detect humans at intervals within segments
- **Mask Quality**: Temporal smoothing, morphological operations, edge refinement
- **Debug Outputs**: Save detection visualizations and first-frame masks

## Processing Flow

### For First Segment (segment_0):
1. Load first frame at inference scale
2. Run YOLO to detect humans
3. Convert detections to SAM2 prompts (or use YOLO masks directly)
4. Initialize SAM2 with prompts/masks
5. Propagate masks through all frames
6. Apply green screen and save output
7. Save final mask for next segment

### For Subsequent Segments:
1. Check if YOLO detection is requested for this segment
2. If yes: Use YOLO detection (same as first segment)
3. If no: Load previous segment's final mask
4. Initialize SAM2 with previous masks
5. Continue propagation through segment
6. Apply green screen and save output

### Fallback Logic:
- If no previous mask exists, searches backwards through segments
- First segment always requires YOLO detection
- Missing detections can be recovered in later segments

## Model Support

### YOLO Models
- **Detection**: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
- **Segmentation**: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)

### SAM2 Models
- **Tiny**: sam2.1_hiera_tiny.pt (fastest, lowest quality)
- **Small**: sam2.1_hiera_small.pt
- **Base+**: sam2.1_hiera_base_plus.pt
- **Large**: sam2.1_hiera_large.pt (best quality, slowest)

## Key Implementation Details

### GPU Optimization
- CUDA device selection with MPS fallback
- CuPy for GPU-accelerated mask operations
- NVENC hardware encoding support
- Batch processing where possible

### Memory Management
- Segments processed sequentially to limit memory usage
- Explicit garbage collection between segments
- Low-resolution inference with high-resolution rendering
- Configurable scale factors for different stages

### Error Handling
- Graceful fallback when masks are unavailable
- Segment-level recovery (can restart individual segments)
- Comprehensive logging at all stages
- Status checking and cleanup utilities

## Debugging Features

### Status Monitoring
```bash
python main.py --config config.yaml --status
```

### Segment Cleanup
```bash
python main.py --config config.yaml --cleanup-segment 5
```

### Debug Outputs
- `yolo_debug.jpg`: Bounding box visualizations
- `first_frame_detection.jpg`: Initial mask visualization
- `mask.png`: Final segment mask for continuity
- `yolo_detections`: Saved detection coordinates

## Common Issues and Solutions

### No Right Eye Detections in VR180
- Lower `yolo_confidence` threshold (try 0.3-0.4)
- Enable debug mode to analyze detection confidence
- Check if person is actually visible in right eye view

### Mask Propagation Failures
- Ensure first segment has successful YOLO detections
- Check previous segment's mask.png exists
- Consider re-running YOLO on problem segments

### Memory Issues
- Reduce `inference_scale` (try 0.25)
- Use smaller models (tiny/small variants)
- Process fewer segments at once

## Development Notes

### Adding Features
- All core modules inherit from base classes in `core/`
- Configuration is centralized through `ConfigLoader`
- Logging uses Python's standard logging module
- File operations go through `utils/file_utils.py`

### Testing Components
- Each module can be tested independently
- Use `--status` flag to check processing state
- Debug outputs help verify each stage

### Performance Tuning
- Adjust `inference_scale` for speed vs quality
- Use `detect_segments` to process only key frames
- Enable `use_nvenc` for hardware encoding
- Consider `vos_optimized` mode for SAM2 (experimental)

## Original Monolithic Script

The project includes the original working script in `spec.md` (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:
- Better error handling and recovery
- Configurable processing pipeline
- Debug and monitoring capabilities
- Cleaner code organization