Files
samyolo_on_segments/claude.md
2025-07-31 11:13:31 -07:00

8.7 KiB

YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide

Project Overview

This repository implements an automated video processing pipeline specifically designed for VR180 side-by-side stereo videos. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.

Core Purpose

The primary goal is to automatically create green screen videos from VR180 content where:

  • Left eye view (left half of frame) contains humans as Object 1 (green masks)
  • Right eye view (right half of frame) contains humans as Object 2 (blue masks)
  • Background is replaced with pure green (RGB: 0,255,0) for chroma keying
  • Original audio is preserved throughout the process
  • Processing handles videos of any length through segmentation

Architecture Overview

Pipeline Stages

  1. Video Segmentation (core/video_splitter.py)

    • Splits long videos into 5-second segments using FFmpeg
    • Creates organized directory structure: segment_0/, segment_1/, etc.
    • Preserves timestamps and forces keyframes for clean cuts
  2. Human Detection (core/yolo_detector.py)

    • Uses YOLOv8 for robust human detection in VR180 format
    • Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
    • Automatically assigns humans to left/right eye based on position in frame
    • Saves detection results for reuse and debugging
  3. Mask Generation (core/sam2_processor.py)

    • Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
    • Propagates masks across all frames in each segment
    • Supports mask continuity between segments using previous segment's final masks
    • Handles VR180 stereo tracking with separate object IDs for each eye
  4. Green Screen Processing (core/mask_processor.py)

    • Applies generated masks to isolate humans
    • Replaces background with green screen
    • Uses GPU acceleration (CuPy) for fast processing
    • Maintains original video quality and framerate
  5. Video Assembly (core/video_assembler.py)

    • Concatenates all processed segments into final video
    • Preserves original audio track from input video
    • Uses hardware encoding (NVENC) when available

Key Components

samyolo_on_segments/
├── main.py                 # Entry point - orchestrates the pipeline
├── config.yaml            # Configuration file (YAML format)
├── core/                  # Core processing modules
│   ├── config_loader.py   # Configuration management
│   ├── video_splitter.py  # FFmpeg-based video segmentation
│   ├── yolo_detector.py   # YOLO human detection (detection/segmentation modes)
│   ├── sam2_processor.py  # SAM2 mask generation and propagation
│   ├── mask_processor.py  # Green screen application
│   └── video_assembler.py # Final video concatenation
├── utils/                 # Utility functions
│   ├── file_utils.py      # File system operations
│   ├── logging_utils.py   # Logging configuration
│   └── status_utils.py    # Progress monitoring
└── models/                # Model storage (created by download_models.py)
    ├── sam2/             # SAM2 checkpoints and configs
    └── yolo/             # YOLO model weights

VR180 Specific Features

Stereo Video Handling

  • Automatically detects humans in left and right eye views
  • Assigns Object ID 1 to left eye humans (green masks)
  • Assigns Object ID 2 to right eye humans (blue masks)
  • Maintains stereo correspondence throughout segments

Frame Division Logic

  • Frame width is divided in half to separate left/right views
  • Human detection centers are used to determine eye assignment
  • If only one human is detected, it may be duplicated to both eyes (configurable)

Configuration System

The pipeline is controlled via config.yaml with these key sections:

Essential Settings

input:
  video_path: "/path/to/vr180_video.mp4"

output:
  directory: "/path/to/output/"
  filename: "greenscreen_output.mp4"

processing:
  segment_duration: 5              # Seconds per segment
  inference_scale: 0.5            # Scale for faster processing
  yolo_confidence: 0.6            # Detection threshold
  detect_segments: "all"          # Which segments to process

models:
  yolo_model: "models/yolo/yolov8n.pt"
  sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
  sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"

Advanced Options

  • YOLO Modes: Switch between detection (bboxes) and segmentation (direct masks)
  • Mid-segment Detection: Re-detect humans at intervals within segments
  • Mask Quality: Temporal smoothing, morphological operations, edge refinement
  • Debug Outputs: Save detection visualizations and first-frame masks

Processing Flow

For First Segment (segment_0):

  1. Load first frame at inference scale
  2. Run YOLO to detect humans
  3. Convert detections to SAM2 prompts (or use YOLO masks directly)
  4. Initialize SAM2 with prompts/masks
  5. Propagate masks through all frames
  6. Apply green screen and save output
  7. Save final mask for next segment

For Subsequent Segments:

  1. Check if YOLO detection is requested for this segment
  2. If yes: Use YOLO detection (same as first segment)
  3. If no: Load previous segment's final mask
  4. Initialize SAM2 with previous masks
  5. Continue propagation through segment
  6. Apply green screen and save output

Fallback Logic:

  • If no previous mask exists, searches backwards through segments
  • First segment always requires YOLO detection
  • Missing detections can be recovered in later segments

Model Support

YOLO Models

  • Detection: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
  • Segmentation: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)

SAM2 Models

  • Tiny: sam2.1_hiera_tiny.pt (fastest, lowest quality)
  • Small: sam2.1_hiera_small.pt
  • Base+: sam2.1_hiera_base_plus.pt
  • Large: sam2.1_hiera_large.pt (best quality, slowest)

Key Implementation Details

GPU Optimization

  • CUDA device selection with MPS fallback
  • CuPy for GPU-accelerated mask operations
  • NVENC hardware encoding support
  • Batch processing where possible

Memory Management

  • Segments processed sequentially to limit memory usage
  • Explicit garbage collection between segments
  • Low-resolution inference with high-resolution rendering
  • Configurable scale factors for different stages

Error Handling

  • Graceful fallback when masks are unavailable
  • Segment-level recovery (can restart individual segments)
  • Comprehensive logging at all stages
  • Status checking and cleanup utilities

Debugging Features

Status Monitoring

python main.py --config config.yaml --status

Segment Cleanup

python main.py --config config.yaml --cleanup-segment 5

Debug Outputs

  • yolo_debug.jpg: Bounding box visualizations
  • first_frame_detection.jpg: Initial mask visualization
  • mask.png: Final segment mask for continuity
  • yolo_detections: Saved detection coordinates

Common Issues and Solutions

No Right Eye Detections in VR180

  • Lower yolo_confidence threshold (try 0.3-0.4)
  • Enable debug mode to analyze detection confidence
  • Check if person is actually visible in right eye view

Mask Propagation Failures

  • Ensure first segment has successful YOLO detections
  • Check previous segment's mask.png exists
  • Consider re-running YOLO on problem segments

Memory Issues

  • Reduce inference_scale (try 0.25)
  • Use smaller models (tiny/small variants)
  • Process fewer segments at once

Development Notes

Adding Features

  • All core modules inherit from base classes in core/
  • Configuration is centralized through ConfigLoader
  • Logging uses Python's standard logging module
  • File operations go through utils/file_utils.py

Testing Components

  • Each module can be tested independently
  • Use --status flag to check processing state
  • Debug outputs help verify each stage

Performance Tuning

  • Adjust inference_scale for speed vs quality
  • Use detect_segments to process only key frames
  • Enable use_nvenc for hardware encoding
  • Consider vos_optimized mode for SAM2 (experimental)

Original Monolithic Script

The project includes the original working script in spec.md (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:

  • Better error handling and recovery
  • Configurable processing pipeline
  • Debug and monitoring capabilities
  • Cleaner code organization