An automated video processing system that combines YOLO object detection with Meta's SAM2 (Segment Anything Model 2) to create green screen videos with precise human segmentation.

Overview

This pipeline processes long videos by splitting them into manageable segments, detecting humans using YOLO, and generating precise masks with SAM2 for green screen background replacement. The system preserves audio and maintains video quality throughout the process.

Features

Automated Human Detection: Uses YOLOv8 for robust human detection
Precise Segmentation: Leverages SAM2 for accurate mask generation
Scalable Processing: Handles videos of any length through segmentation
GPU Acceleration: CUDA/NVENC support for faster processing
Audio Preservation: Maintains original audio track in output
Stereo Video Support: Handles VR/360 content with left/right tracking
Configurable Pipeline: YAML-based configuration for easy customization

Installation

Prerequisites

Python 3.8+
NVIDIA GPU with CUDA support (recommended)
FFmpeg installed and available in PATH

Install Dependencies

# Clone the repository
git clone <repository-url>
cd samyolo_on_segments

# Install Python dependencies
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

Download Models

Use the provided script to automatically download all required models:

# Download SAM2.1 and YOLO models
python download_models.py

This script will:

Create a models/ directory structure
Download SAM2.1 configs and checkpoints (tiny, small, base+, large)
Download common YOLO models (yolov8n, yolov8s, yolov8m)
Update config.yaml to use local model paths

Manual Download (Alternative):

SAM2 Models: Download from Meta's SAM2 repository
YOLO Models: YOLOv8 models will be downloaded automatically on first use

Quick Start

1. Download Models

First, download the required SAM2.1 and YOLO models:

python download_models.py

2. Configure the Pipeline

Edit config.yaml to specify your input video and desired settings:

input:
  video_path: "/path/to/your/video.mp4"
  
output:
  directory: "/path/to/output/"
  filename: "processed_video.mp4"
  
processing:
  segment_duration: 5
  inference_scale: 0.5
  yolo_confidence: 0.6
  detect_segments: "all"
  
models:
  yolo_model: "models/yolo/yolov8n.pt"
  sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
  sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"

3. Run the Pipeline

python main.py --config config.yaml

4. Monitor Progress

Check processing status:

python main.py --config config.yaml --status

Clean up a specific segment for reprocessing:

python main.py --config config.yaml --cleanup-segment 5

Configuration Options

Input/Output Settings

Parameter	Description	Default
`input.video_path`	Path to input video file	Required
`output.directory`	Output directory path	Required
`output.filename`	Output video filename	Required

Processing Parameters

Parameter	Description	Default
`processing.segment_duration`	Duration of each segment (seconds)	5
`processing.inference_scale`	Scale factor for SAM2 inference	0.5
`processing.yolo_confidence`	YOLO detection confidence threshold	0.6
`processing.detect_segments`	Segments to process ("all" or list)	"all"

Model Configuration

Parameter	Description	Default
`models.yolo_model`	YOLO model path or name	"yolov8n.pt"
`models.sam2_checkpoint`	SAM2 checkpoint path	Required
`models.sam2_config`	SAM2 config file path	Required

Video Settings

Parameter	Description	Default
`video.use_nvenc`	Use NVIDIA hardware encoding	true
`video.output_bitrate`	Output video bitrate	"50M"
`video.preserve_audio`	Copy original audio track	true
`video.force_keyframes`	Force keyframes for clean cuts	true

Advanced Options

Parameter	Description	Default
`advanced.green_color`	Green screen RGB color	[0, 255, 0]
`advanced.blue_color`	Blue screen RGB color	[255, 0, 0]
`advanced.human_class_id`	YOLO human class ID	0
`advanced.log_level`	Logging verbosity	"INFO"
`advanced.cleanup_intermediate_files`	Clean up temp files	true

Processing Pipeline

Step 1: Video Segmentation

Splits input video into configurable segments (default 5 seconds)
Creates organized directory structure: video_segments/segment_0/, segment_1/, etc.
Uses FFmpeg with keyframe forcing for clean cuts

Step 2: Human Detection

Runs YOLO detection on specified segments
Detects human bounding boxes with configurable confidence threshold
Saves detection results for reuse and debugging

Step 3: SAM2 Segmentation (In Development)

Uses YOLO detections as prompts for SAM2
Generates precise masks for detected humans
Propagates masks across all frames in segments

Step 4: Green Screen Processing (In Development)

Applies generated masks to isolate humans
Replaces background with green screen
Maintains video quality and framerate

Step 5: Video Assembly (In Development)

Concatenates processed segments
Preserves original audio track
Outputs final video with green screen background

Project Structure

samyolo_on_segments/
├── README.md              # This documentation
├── config.yaml           # Default configuration
├── main.py               # Main entry point
├── download_models.py    # Model download script
├── requirements.txt      # Python dependencies
├── spec.md              # Detailed specification
├── models/              # Downloaded models (created by script)
│   ├── sam2/
│   │   ├── configs/sam2.1/     # SAM2.1 configuration files
│   │   │   ├── sam2.1_hiera_t.yaml
│   │   │   ├── sam2.1_hiera_s.yaml
│   │   │   ├── sam2.1_hiera_b+.yaml
│   │   │   └── sam2.1_hiera_l.yaml
│   │   └── checkpoints/        # SAM2.1 model weights
│   │       ├── sam2.1_hiera_tiny.pt
│   │       ├── sam2.1_hiera_small.pt
│   │       ├── sam2.1_hiera_base_plus.pt
│   │       └── sam2.1_hiera_large.pt
│   └── yolo/            # YOLO model weights
│       ├── yolov8n.pt
│       ├── yolov8s.pt
│       └── yolov8m.pt
├── core/                # Core processing modules
│   ├── __init__.py
│   ├── config_loader.py  # Configuration management
│   ├── sam2_processor.py # SAM2 segmentation (planned)
│   ├── video_splitter.py # Video segmentation
│   └── yolo_detector.py  # YOLO human detection
└── utils/               # Utility modules
    ├── __init__.py
    ├── file_utils.py     # File operations
    ├── logging_utils.py  # Logging configuration
    └── status_utils.py   # Progress monitoring

Usage Examples

Basic Processing

python main.py --config config.yaml

Custom Configuration

python main.py --config my_custom_config.yaml --log-file processing.log

Process Specific Segments Only

processing:
  detect_segments: [0, 5, 10, 15]  # Only process these segments

High-Quality Processing

processing:
  inference_scale: 1.0  # Full resolution inference
video:
  output_bitrate: "100M"  # Higher bitrate

Performance Considerations

Hardware Requirements

GPU: NVIDIA GPU with 8GB+ VRAM (recommended)
RAM: 16GB+ for processing large videos
Storage: SSD recommended for temporary files

Processing Time

Approximately 1-2x real-time on modern GPUs
Scales with video resolution and number of segments
YOLO detection: ~1-2 seconds per segment
SAM2 processing: ~10-30 seconds per segment (estimated)

Optimization Tips

Use inference_scale: 0.5 for faster processing
Process only key segments with detect_segments list
Enable NVENC for hardware-accelerated encoding
Use SSD storage for temporary files

Troubleshooting

Common Issues

ImportError: No module named 'sam2'

pip install git+https://github.com/facebookresearch/sam2.git

CUDA out of memory

Reduce inference_scale to 0.25 or 0.5
Process fewer segments at once
Use a smaller YOLO model (yolov8n.pt instead of yolov8x.pt)

FFmpeg not found

# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

No humans detected

Lower yolo_confidence threshold
Check that humans are clearly visible in the video
Verify the input video format is supported

Debug Mode

Enable detailed logging:

advanced:
  log_level: "DEBUG"

Current Status

Implemented:

✅ Video segmentation with FFmpeg
✅ YOLO human detection
✅ Configuration management
✅ Progress monitoring
✅ Segment cleanup utilities

In Development:

🚧 SAM2 integration and mask generation
🚧 Green screen processing
🚧 Video assembly with audio

Planned:

📋 Multi-object tracking
📋 Real-time processing support
📋 Web interface
📋 Cloud processing integration

Contributing

This project is under active development. The core detection pipeline is functional, with SAM2 integration and green screen processing coming soon.

License

[Add your license information here]

Support

For issues and questions:

Check the troubleshooting section
Review the logs with log_level: "DEBUG"
Open an issue with your configuration and error details