Files

Scott Register b97a3752a7 stereo mask working

2025-07-31 11:13:31 -07:00

8.7 KiB

Raw Blame History

YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide

Project Overview

This repository implements an automated video processing pipeline specifically designed for VR180 side-by-side stereo videos. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.

Core Purpose

The primary goal is to automatically create green screen videos from VR180 content where:

Left eye view (left half of frame) contains humans as Object 1 (green masks)
Right eye view (right half of frame) contains humans as Object 2 (blue masks)
Background is replaced with pure green (RGB: 0,255,0) for chroma keying
Original audio is preserved throughout the process
Processing handles videos of any length through segmentation

Architecture Overview

Pipeline Stages

Video Segmentation (core/video_splitter.py)
- Splits long videos into 5-second segments using FFmpeg
- Creates organized directory structure: segment_0/, segment_1/, etc.
- Preserves timestamps and forces keyframes for clean cuts
Human Detection (core/yolo_detector.py)
- Uses YOLOv8 for robust human detection in VR180 format
- Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
- Automatically assigns humans to left/right eye based on position in frame
- Saves detection results for reuse and debugging
Mask Generation (core/sam2_processor.py)
- Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
- Propagates masks across all frames in each segment
- Supports mask continuity between segments using previous segment's final masks
- Handles VR180 stereo tracking with separate object IDs for each eye
Green Screen Processing (core/mask_processor.py)
- Applies generated masks to isolate humans
- Replaces background with green screen
- Uses GPU acceleration (CuPy) for fast processing
- Maintains original video quality and framerate
Video Assembly (core/video_assembler.py)
- Concatenates all processed segments into final video
- Preserves original audio track from input video
- Uses hardware encoding (NVENC) when available

Key Components

samyolo_on_segments/
├── main.py                 # Entry point - orchestrates the pipeline
├── config.yaml            # Configuration file (YAML format)
├── core/                  # Core processing modules
│   ├── config_loader.py   # Configuration management
│   ├── video_splitter.py  # FFmpeg-based video segmentation
│   ├── yolo_detector.py   # YOLO human detection (detection/segmentation modes)
│   ├── sam2_processor.py  # SAM2 mask generation and propagation
│   ├── mask_processor.py  # Green screen application
│   └── video_assembler.py # Final video concatenation
├── utils/                 # Utility functions
│   ├── file_utils.py      # File system operations
│   ├── logging_utils.py   # Logging configuration
│   └── status_utils.py    # Progress monitoring
└── models/                # Model storage (created by download_models.py)
    ├── sam2/             # SAM2 checkpoints and configs
    └── yolo/             # YOLO model weights

VR180 Specific Features

Stereo Video Handling

Automatically detects humans in left and right eye views
Assigns Object ID 1 to left eye humans (green masks)
Assigns Object ID 2 to right eye humans (blue masks)
Maintains stereo correspondence throughout segments

Frame Division Logic

Frame width is divided in half to separate left/right views
Human detection centers are used to determine eye assignment
If only one human is detected, it may be duplicated to both eyes (configurable)

Configuration System

The pipeline is controlled via config.yaml with these key sections:

Essential Settings

input:
  video_path: "/path/to/vr180_video.mp4"

output:
  directory: "/path/to/output/"
  filename: "greenscreen_output.mp4"

processing:
  segment_duration: 5              # Seconds per segment
  inference_scale: 0.5            # Scale for faster processing
  yolo_confidence: 0.6            # Detection threshold
  detect_segments: "all"          # Which segments to process

models:
  yolo_model: "models/yolo/yolov8n.pt"
  sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
  sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"

Advanced Options

YOLO Modes: Switch between detection (bboxes) and segmentation (direct masks)
Mid-segment Detection: Re-detect humans at intervals within segments
Mask Quality: Temporal smoothing, morphological operations, edge refinement
Debug Outputs: Save detection visualizations and first-frame masks

Processing Flow

For First Segment (segment_0):

Load first frame at inference scale
Run YOLO to detect humans
Convert detections to SAM2 prompts (or use YOLO masks directly)
Initialize SAM2 with prompts/masks
Propagate masks through all frames
Apply green screen and save output
Save final mask for next segment

For Subsequent Segments:

Check if YOLO detection is requested for this segment
If yes: Use YOLO detection (same as first segment)
If no: Load previous segment's final mask
Initialize SAM2 with previous masks
Continue propagation through segment
Apply green screen and save output

Fallback Logic:

If no previous mask exists, searches backwards through segments
First segment always requires YOLO detection
Missing detections can be recovered in later segments

Model Support

YOLO Models

Detection: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
Segmentation: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)

SAM2 Models

Tiny: sam2.1_hiera_tiny.pt (fastest, lowest quality)
Small: sam2.1_hiera_small.pt
Base+: sam2.1_hiera_base_plus.pt
Large: sam2.1_hiera_large.pt (best quality, slowest)

Key Implementation Details

GPU Optimization

CUDA device selection with MPS fallback
CuPy for GPU-accelerated mask operations
NVENC hardware encoding support
Batch processing where possible

Memory Management

Segments processed sequentially to limit memory usage
Explicit garbage collection between segments
Low-resolution inference with high-resolution rendering
Configurable scale factors for different stages

Error Handling

Graceful fallback when masks are unavailable
Segment-level recovery (can restart individual segments)
Comprehensive logging at all stages
Status checking and cleanup utilities

Debugging Features

Status Monitoring

python main.py --config config.yaml --status

Segment Cleanup

python main.py --config config.yaml --cleanup-segment 5

Debug Outputs

yolo_debug.jpg: Bounding box visualizations
first_frame_detection.jpg: Initial mask visualization
mask.png: Final segment mask for continuity
yolo_detections: Saved detection coordinates

Common Issues and Solutions

No Right Eye Detections in VR180

Lower yolo_confidence threshold (try 0.3-0.4)
Enable debug mode to analyze detection confidence
Check if person is actually visible in right eye view

Mask Propagation Failures

Ensure first segment has successful YOLO detections
Check previous segment's mask.png exists
Consider re-running YOLO on problem segments

Memory Issues

Reduce inference_scale (try 0.25)
Use smaller models (tiny/small variants)
Process fewer segments at once

Development Notes

Adding Features

All core modules inherit from base classes in core/
Configuration is centralized through ConfigLoader
Logging uses Python's standard logging module
File operations go through utils/file_utils.py

Testing Components

Each module can be tested independently
Use --status flag to check processing state
Debug outputs help verify each stage

Performance Tuning

Adjust inference_scale for speed vs quality
Use detect_segments to process only key frames
Enable use_nvenc for hardware encoding
Consider vos_optimized mode for SAM2 (experimental)

Original Monolithic Script

The project includes the original working script in spec.md (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:

Better error handling and recovery
Configurable processing pipeline
Debug and monitoring capabilities
Cleaner code organization

8.7 KiB Raw Blame History