2025-07-29 10:13:29 -07:00
2025-07-27 14:26:20 -07:00
2025-07-27 11:43:07 -07:00
2025-07-27 13:55:52 -07:00
2025-07-29 10:13:29 -07:00
2025-07-29 10:13:29 -07:00
2025-07-27 12:11:36 -07:00
2025-07-27 13:55:52 -07:00
2025-07-27 12:11:36 -07:00

YOLO + SAM2 Video Processing Pipeline

An automated video processing system that combines YOLO object detection with Meta's SAM2 (Segment Anything Model 2) to create green screen videos with precise human segmentation.

Overview

This pipeline processes long videos by splitting them into manageable segments, detecting humans using YOLO, and generating precise masks with SAM2 for green screen background replacement. The system preserves audio and maintains video quality throughout the process.

Features

  • Automated Human Detection: Uses YOLOv8 for robust human detection
  • Precise Segmentation: Leverages SAM2 for accurate mask generation
  • Scalable Processing: Handles videos of any length through segmentation
  • GPU Acceleration: CUDA/NVENC support for faster processing
  • Audio Preservation: Maintains original audio track in output
  • Stereo Video Support: Handles VR/360 content with left/right tracking
  • Configurable Pipeline: YAML-based configuration for easy customization

Installation

Prerequisites

  • Python 3.8+
  • NVIDIA GPU with CUDA support (recommended)
  • FFmpeg installed and available in PATH

Install Dependencies

# Clone the repository
git clone <repository-url>
cd samyolo_on_segments

# Install Python dependencies
uv venv && source .venv/bin/activate
uv pip install -r requirements.txt

Download Models

Use the provided script to automatically download all required models:

# Download SAM2.1 and YOLO models
python download_models.py

This script will:

  • Create a models/ directory structure
  • Download SAM2.1 configs and checkpoints (tiny, small, base+, large)
  • Download common YOLO models (yolov8n, yolov8s, yolov8m)
  • Update config.yaml to use local model paths

Manual Download (Alternative):

  1. SAM2 Models: Download from Meta's SAM2 repository
  2. YOLO Models: YOLOv8 models will be downloaded automatically on first use

Quick Start

1. Download Models

First, download the required SAM2.1 and YOLO models:

python download_models.py

2. Configure the Pipeline

Edit config.yaml to specify your input video and desired settings:

input:
  video_path: "/path/to/your/video.mp4"
  
output:
  directory: "/path/to/output/"
  filename: "processed_video.mp4"
  
processing:
  segment_duration: 5
  inference_scale: 0.5
  yolo_confidence: 0.6
  detect_segments: "all"
  
models:
  yolo_model: "models/yolo/yolov8n.pt"
  sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
  sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"

3. Run the Pipeline

python main.py --config config.yaml

4. Monitor Progress

Check processing status:

python main.py --config config.yaml --status

Clean up a specific segment for reprocessing:

python main.py --config config.yaml --cleanup-segment 5

Configuration Options

Input/Output Settings

Parameter Description Default
input.video_path Path to input video file Required
output.directory Output directory path Required
output.filename Output video filename Required

Processing Parameters

Parameter Description Default
processing.segment_duration Duration of each segment (seconds) 5
processing.inference_scale Scale factor for SAM2 inference 0.5
processing.yolo_confidence YOLO detection confidence threshold 0.6
processing.detect_segments Segments to process ("all" or list) "all"

Model Configuration

Parameter Description Default
models.yolo_model YOLO model path or name "yolov8n.pt"
models.sam2_checkpoint SAM2 checkpoint path Required
models.sam2_config SAM2 config file path Required

Video Settings

Parameter Description Default
video.use_nvenc Use NVIDIA hardware encoding true
video.output_bitrate Output video bitrate "50M"
video.preserve_audio Copy original audio track true
video.force_keyframes Force keyframes for clean cuts true

Advanced Options

Parameter Description Default
advanced.green_color Green screen RGB color [0, 255, 0]
advanced.blue_color Blue screen RGB color [255, 0, 0]
advanced.human_class_id YOLO human class ID 0
advanced.log_level Logging verbosity "INFO"
advanced.cleanup_intermediate_files Clean up temp files true

Processing Pipeline

Step 1: Video Segmentation

  • Splits input video into configurable segments (default 5 seconds)
  • Creates organized directory structure: video_segments/segment_0/, segment_1/, etc.
  • Uses FFmpeg with keyframe forcing for clean cuts

Step 2: Human Detection

  • Runs YOLO detection on specified segments
  • Detects human bounding boxes with configurable confidence threshold
  • Saves detection results for reuse and debugging

Step 3: SAM2 Segmentation (In Development)

  • Uses YOLO detections as prompts for SAM2
  • Generates precise masks for detected humans
  • Propagates masks across all frames in segments

Step 4: Green Screen Processing (In Development)

  • Applies generated masks to isolate humans
  • Replaces background with green screen
  • Maintains video quality and framerate

Step 5: Video Assembly (In Development)

  • Concatenates processed segments
  • Preserves original audio track
  • Outputs final video with green screen background

Project Structure

samyolo_on_segments/
├── README.md              # This documentation
├── config.yaml           # Default configuration
├── main.py               # Main entry point
├── download_models.py    # Model download script
├── requirements.txt      # Python dependencies
├── spec.md              # Detailed specification
├── models/              # Downloaded models (created by script)
│   ├── sam2/
│   │   ├── configs/sam2.1/     # SAM2.1 configuration files
│   │   │   ├── sam2.1_hiera_t.yaml
│   │   │   ├── sam2.1_hiera_s.yaml
│   │   │   ├── sam2.1_hiera_b+.yaml
│   │   │   └── sam2.1_hiera_l.yaml
│   │   └── checkpoints/        # SAM2.1 model weights
│   │       ├── sam2.1_hiera_tiny.pt
│   │       ├── sam2.1_hiera_small.pt
│   │       ├── sam2.1_hiera_base_plus.pt
│   │       └── sam2.1_hiera_large.pt
│   └── yolo/            # YOLO model weights
│       ├── yolov8n.pt
│       ├── yolov8s.pt
│       └── yolov8m.pt
├── core/                # Core processing modules
│   ├── __init__.py
│   ├── config_loader.py  # Configuration management
│   ├── sam2_processor.py # SAM2 segmentation (planned)
│   ├── video_splitter.py # Video segmentation
│   └── yolo_detector.py  # YOLO human detection
└── utils/               # Utility modules
    ├── __init__.py
    ├── file_utils.py     # File operations
    ├── logging_utils.py  # Logging configuration
    └── status_utils.py   # Progress monitoring

Usage Examples

Basic Processing

python main.py --config config.yaml

Custom Configuration

python main.py --config my_custom_config.yaml --log-file processing.log

Process Specific Segments Only

processing:
  detect_segments: [0, 5, 10, 15]  # Only process these segments

High-Quality Processing

processing:
  inference_scale: 1.0  # Full resolution inference
video:
  output_bitrate: "100M"  # Higher bitrate

Performance Considerations

Hardware Requirements

  • GPU: NVIDIA GPU with 8GB+ VRAM (recommended)
  • RAM: 16GB+ for processing large videos
  • Storage: SSD recommended for temporary files

Processing Time

  • Approximately 1-2x real-time on modern GPUs
  • Scales with video resolution and number of segments
  • YOLO detection: ~1-2 seconds per segment
  • SAM2 processing: ~10-30 seconds per segment (estimated)

Optimization Tips

  1. Use inference_scale: 0.5 for faster processing
  2. Process only key segments with detect_segments list
  3. Enable NVENC for hardware-accelerated encoding
  4. Use SSD storage for temporary files

Troubleshooting

Common Issues

ImportError: No module named 'sam2'

pip install git+https://github.com/facebookresearch/sam2.git

CUDA out of memory

  • Reduce inference_scale to 0.25 or 0.5
  • Process fewer segments at once
  • Use a smaller YOLO model (yolov8n.pt instead of yolov8x.pt)

FFmpeg not found

# Ubuntu/Debian
sudo apt install ffmpeg

# macOS
brew install ffmpeg

# Windows
# Download from https://ffmpeg.org/download.html

No humans detected

  • Lower yolo_confidence threshold
  • Check that humans are clearly visible in the video
  • Verify the input video format is supported

Debug Mode

Enable detailed logging:

advanced:
  log_level: "DEBUG"

Current Status

Implemented:

  • Video segmentation with FFmpeg
  • YOLO human detection
  • Configuration management
  • Progress monitoring
  • Segment cleanup utilities

In Development:

  • 🚧 SAM2 integration and mask generation
  • 🚧 Green screen processing
  • 🚧 Video assembly with audio

Planned:

  • 📋 Multi-object tracking
  • 📋 Real-time processing support
  • 📋 Web interface
  • 📋 Cloud processing integration

Contributing

This project is under active development. The core detection pipeline is functional, with SAM2 integration and green screen processing coming soon.

License

[Add your license information here]

Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review the logs with log_level: "DEBUG"
  3. Open an issue with your configuration and error details
Description
No description provided
Readme 299 KiB
Languages
Python 100%