7.2 KiB
7.2 KiB
YOLO + SAM2 Video Processing Pipeline
Overview
This project provides an automated video processing pipeline that uses YOLO for human detection and SAM2 for precise segmentation to create green screen videos. The system processes long videos by splitting them into manageable segments, detecting and tracking humans in each segment, and then reassembling the processed segments into a final output video with preserved audio.
Core Functionality
Input
- Long video file (MP4 format, any duration)
- Configuration file (YAML format) specifying processing parameters
Output
- Processed video file with humans visible and background replaced with green screen
- Preserved audio from the original input video
- Intermediate files for debugging and quality control
Processing Pipeline
1. Video Segmentation
- Splits input video into configurable-duration segments (default: 5 seconds)
- Creates organized directory structure:
segment_0/,segment_1/, etc. - Each segment folder contains the segment video file
- Generates force keyframes for consistent encoding
2. Human Detection & Tracking
- YOLO Detection: Automatically detects humans in keyframe segments using YOLOv8
- SAM2 Segmentation: Uses detected bounding boxes as prompts for precise mask generation
- Mask Propagation: Propagates masks across all frames in each segment
- Stereo Video Support: Handles VR/stereo content with left/right human assignment
- Continuity: Non-keyframe segments use previous segment masks for consistency
3. Green Screen Processing
- Mask Application: Applies generated masks to isolate humans
- Background Replacement: Replaces non-human areas with green screen (RGB: 0,255,0)
- GPU Acceleration: Uses CuPy for fast mask processing
- Multi-resolution: Low-res inference for speed, full-res final rendering
4. Video Assembly
- Segment Concatenation: Combines all processed segments into single video
- Audio Preservation: Copies original audio track to final output
- Quality Maintenance: Preserves original video quality and framerate
Key Features
Automated Processing
- No Manual Intervention: Fully automated human detection eliminates manual point selection
- Batch Processing: Processes multiple segments efficiently
- Smart Fallback: Robust mask propagation with intelligent previous-segment loading
Modular Architecture
- Configuration-Driven: YAML-based configuration for easy parameter adjustment
- Extensible Design: Modular structure allows for easy feature additions
- Error Recovery: Graceful handling of detection failures and missing segments
Performance Optimizations
- GPU Acceleration: CUDA/NVENC support for faster processing
- Memory Management: Efficient handling of large videos through segmentation
- Concurrent Processing: Thread-safe operations where applicable
Technical Stack
Core Dependencies
- SAM2: Facebook's Segment Anything Model 2 for precise segmentation
- YOLOv8 (Ultralytics): Human detection and bounding box generation
- OpenCV: Video processing and frame manipulation
- CuPy: GPU-accelerated array operations
- FFmpeg: Video encoding/decoding and audio handling
- PyTorch: Deep learning framework backend
Supported Formats
- Input Video: MP4, AVI, MOV (any OpenCV-supported format)
- Output Video: MP4 with H.265/HEVC encoding
- Audio: Preserves original audio codec and quality
Configuration Options
Video Processing
segment_duration: Duration of each video segment (seconds)inference_scale: Scale factor for SAM2 inference (for speed)output_scale: Scale factor for final output
Detection Parameters
yolo_model: Path to YOLO model weightsyolo_confidence: Detection confidence thresholddetect_segments: Which segments to run YOLO detection on
SAM2 Parameters
sam2_checkpoint: Path to SAM2 model weightssam2_config: SAM2 model configuration file
Output Options
use_nvenc: Enable NVIDIA hardware encodingoutput_bitrate: Video bitrate for final outputpreserve_audio: Whether to copy audio track
Directory Structure
new_yolo/
├── spec.md # This specification document
├── requirements.txt # Python dependencies
├── config.yaml # Default configuration file
├── main.py # Entry point script
├── core/
│ ├── __init__.py
│ ├── video_splitter.py # Video segmentation logic
│ ├── yolo_detector.py # YOLO human detection
│ ├── sam2_processor.py # SAM2 segmentation
│ ├── mask_processor.py # Mask application and green screen
│ ├── video_assembler.py # Final video assembly
│ └── config_loader.py # Configuration management
├── utils/
│ ├── __init__.py
│ ├── file_utils.py # File system operations
│ ├── video_utils.py # Video processing utilities
│ └── logging_utils.py # Logging configuration
└── examples/
├── basic_config.yaml # Example configuration
└── advanced_config.yaml # Advanced configuration options
Usage Examples
Basic Usage
python main.py --config config.yaml
Custom Configuration
python main.py --config examples/advanced_config.yaml
Configuration File Example
input:
video_path: "/path/to/input/video.mp4"
output:
directory: "/path/to/output/"
filename: "processed_video.mp4"
processing:
segment_duration: 5
inference_scale: 0.5
yolo_confidence: 0.6
detect_segments: "all" # or [0, 5, 10]
models:
yolo_model: "yolov8n.pt"
sam2_checkpoint: "../checkpoints/sam2.1_hiera_large.pt"
sam2_config: "configs/sam2.1/sam2.1_hiera_l.yaml"
Use Cases
Content Creation
- VR/360 Video Processing: Remove backgrounds from immersive content
- Green Screen Production: Automated background removal for video production
- Social Media Content: Quick background replacement for content creators
Commercial Applications
- Video Conferencing: Real-time background replacement
- E-learning: Professional video production with clean backgrounds
- Marketing: Product demonstration videos with custom backgrounds
Performance Considerations
Hardware Requirements
- GPU: NVIDIA GPU with CUDA support (recommended)
- RAM: 16GB+ for processing large videos
- Storage: SSD recommended for temporary file operations
Processing Time
- Approximately 1-2x real-time on modern GPUs
- Scales with video resolution and segment count
- Memory usage remains constant regardless of input video length
Future Enhancements
Planned Features
- Multi-object Tracking: Support for multiple humans per frame
- Custom Object Detection: Configurable object classes beyond humans
- Real-time Processing: Live video stream support
- Cloud Integration: AWS/GCP processing support
- Web Interface: Browser-based configuration and monitoring
Model Improvements
- Fine-tuned YOLO: Domain-specific human detection models
- SAM2 Optimization: Custom SAM2 checkpoints for video content
- Temporal Consistency: Enhanced cross-segment mask propagation