Files

Scott Register b20a8281e5 inital commit

2025-07-27 11:43:07 -07:00

7.2 KiB

Raw Blame History

YOLO + SAM2 Video Processing Pipeline

Overview

This project provides an automated video processing pipeline that uses YOLO for human detection and SAM2 for precise segmentation to create green screen videos. The system processes long videos by splitting them into manageable segments, detecting and tracking humans in each segment, and then reassembling the processed segments into a final output video with preserved audio.

Core Functionality

Input

Long video file (MP4 format, any duration)
Configuration file (YAML format) specifying processing parameters

Output

Processed video file with humans visible and background replaced with green screen
Preserved audio from the original input video
Intermediate files for debugging and quality control

Processing Pipeline

1. Video Segmentation

Splits input video into configurable-duration segments (default: 5 seconds)
Creates organized directory structure: segment_0/, segment_1/, etc.
Each segment folder contains the segment video file
Generates force keyframes for consistent encoding

2. Human Detection & Tracking

YOLO Detection: Automatically detects humans in keyframe segments using YOLOv8
SAM2 Segmentation: Uses detected bounding boxes as prompts for precise mask generation
Mask Propagation: Propagates masks across all frames in each segment
Stereo Video Support: Handles VR/stereo content with left/right human assignment
Continuity: Non-keyframe segments use previous segment masks for consistency

3. Green Screen Processing

Mask Application: Applies generated masks to isolate humans
Background Replacement: Replaces non-human areas with green screen (RGB: 0,255,0)
GPU Acceleration: Uses CuPy for fast mask processing
Multi-resolution: Low-res inference for speed, full-res final rendering

4. Video Assembly

Segment Concatenation: Combines all processed segments into single video
Audio Preservation: Copies original audio track to final output
Quality Maintenance: Preserves original video quality and framerate

Key Features

Automated Processing

No Manual Intervention: Fully automated human detection eliminates manual point selection
Batch Processing: Processes multiple segments efficiently
Smart Fallback: Robust mask propagation with intelligent previous-segment loading

Modular Architecture

Configuration-Driven: YAML-based configuration for easy parameter adjustment
Extensible Design: Modular structure allows for easy feature additions
Error Recovery: Graceful handling of detection failures and missing segments

Performance Optimizations

GPU Acceleration: CUDA/NVENC support for faster processing
Memory Management: Efficient handling of large videos through segmentation
Concurrent Processing: Thread-safe operations where applicable

Technical Stack

Core Dependencies

SAM2: Facebook's Segment Anything Model 2 for precise segmentation
YOLOv8 (Ultralytics): Human detection and bounding box generation
OpenCV: Video processing and frame manipulation
CuPy: GPU-accelerated array operations
FFmpeg: Video encoding/decoding and audio handling
PyTorch: Deep learning framework backend

Supported Formats

Input Video: MP4, AVI, MOV (any OpenCV-supported format)
Output Video: MP4 with H.265/HEVC encoding
Audio: Preserves original audio codec and quality

Configuration Options

Video Processing

segment_duration: Duration of each video segment (seconds)
inference_scale: Scale factor for SAM2 inference (for speed)
output_scale: Scale factor for final output

Detection Parameters

yolo_model: Path to YOLO model weights
yolo_confidence: Detection confidence threshold
detect_segments: Which segments to run YOLO detection on

SAM2 Parameters

sam2_checkpoint: Path to SAM2 model weights
sam2_config: SAM2 model configuration file

Output Options

use_nvenc: Enable NVIDIA hardware encoding
output_bitrate: Video bitrate for final output
preserve_audio: Whether to copy audio track

Directory Structure

new_yolo/
├── spec.md                 # This specification document
├── requirements.txt        # Python dependencies
├── config.yaml            # Default configuration file
├── main.py                # Entry point script
├── core/
│   ├── __init__.py
│   ├── video_splitter.py   # Video segmentation logic
│   ├── yolo_detector.py    # YOLO human detection
│   ├── sam2_processor.py   # SAM2 segmentation
│   ├── mask_processor.py   # Mask application and green screen
│   ├── video_assembler.py  # Final video assembly
│   └── config_loader.py    # Configuration management
├── utils/
│   ├── __init__.py
│   ├── file_utils.py       # File system operations
│   ├── video_utils.py      # Video processing utilities
│   └── logging_utils.py    # Logging configuration
└── examples/
    ├── basic_config.yaml   # Example configuration
    └── advanced_config.yaml # Advanced configuration options

Usage Examples

Basic Usage

python main.py --config config.yaml

Custom Configuration

python main.py --config examples/advanced_config.yaml

Configuration File Example

input:
  video_path: "/path/to/input/video.mp4"
  
output:
  directory: "/path/to/output/"
  filename: "processed_video.mp4"
  
processing:
  segment_duration: 5
  inference_scale: 0.5
  yolo_confidence: 0.6
  detect_segments: "all"  # or [0, 5, 10]
  
models:
  yolo_model: "yolov8n.pt"
  sam2_checkpoint: "../checkpoints/sam2.1_hiera_large.pt"
  sam2_config: "configs/sam2.1/sam2.1_hiera_l.yaml"

Use Cases

Content Creation

VR/360 Video Processing: Remove backgrounds from immersive content
Green Screen Production: Automated background removal for video production
Social Media Content: Quick background replacement for content creators

Commercial Applications

Video Conferencing: Real-time background replacement
E-learning: Professional video production with clean backgrounds
Marketing: Product demonstration videos with custom backgrounds

Performance Considerations

Hardware Requirements

GPU: NVIDIA GPU with CUDA support (recommended)
RAM: 16GB+ for processing large videos
Storage: SSD recommended for temporary file operations

Processing Time

Approximately 1-2x real-time on modern GPUs
Scales with video resolution and segment count
Memory usage remains constant regardless of input video length

Future Enhancements

Planned Features

Multi-object Tracking: Support for multiple humans per frame
Custom Object Detection: Configurable object classes beyond humans
Real-time Processing: Live video stream support
Cloud Integration: AWS/GCP processing support
Web Interface: Browser-based configuration and monitoring

Model Improvements

Fine-tuned YOLO: Domain-specific human detection models
SAM2 Optimization: Custom SAM2 checkpoints for video content
Temporal Consistency: Enhanced cross-segment mask propagation

7.2 KiB Raw Blame History