YOLO + SAM2 Video Processing Pipeline
An automated video processing system that combines YOLO object detection with Meta's SAM2 (Segment Anything Model 2) to create green screen videos with precise human segmentation.
Overview
This pipeline processes long videos by splitting them into manageable segments, detecting humans using YOLO, and generating precise masks with SAM2 for green screen background replacement. The system preserves audio and maintains video quality throughout the process.
Features
- Automated Human Detection: Uses YOLOv8 for robust human detection
- Precise Segmentation: Leverages SAM2 for accurate mask generation
- Scalable Processing: Handles videos of any length through segmentation
- GPU Acceleration: CUDA/NVENC support for faster processing
- Audio Preservation: Maintains original audio track in output
- Stereo Video Support: Handles VR/360 content with left/right tracking
- Configurable Pipeline: YAML-based configuration for easy customization
Installation
Prerequisites
- Python 3.8+
- NVIDIA GPU with CUDA support (recommended)
- FFmpeg installed and available in PATH
Install Dependencies
# Clone the repository
git clone <repository-url>
cd samyolo_on_segments
# Install Python dependencies
pip install -r requirements.txt
Model Dependencies
You'll need to download the required model checkpoints:
- SAM2 Models: Download from Meta's SAM2 repository
- YOLO Models: YOLOv8 models will be downloaded automatically or you can specify a custom path
Quick Start
1. Configure the Pipeline
Edit config.yaml to specify your input video and desired settings:
input:
video_path: "/path/to/your/video.mp4"
output:
directory: "/path/to/output/"
filename: "processed_video.mp4"
processing:
segment_duration: 5
inference_scale: 0.5
yolo_confidence: 0.6
detect_segments: "all"
models:
yolo_model: "yolov8n.pt"
sam2_checkpoint: "../checkpoints/sam2.1_hiera_large.pt"
sam2_config: "configs/sam2.1/sam2.1_hiera_l.yaml"
2. Run the Pipeline
python main.py --config config.yaml
3. Monitor Progress
Check processing status:
python main.py --config config.yaml --status
Clean up a specific segment for reprocessing:
python main.py --config config.yaml --cleanup-segment 5
Configuration Options
Input/Output Settings
| Parameter | Description | Default |
|---|---|---|
input.video_path |
Path to input video file | Required |
output.directory |
Output directory path | Required |
output.filename |
Output video filename | Required |
Processing Parameters
| Parameter | Description | Default |
|---|---|---|
processing.segment_duration |
Duration of each segment (seconds) | 5 |
processing.inference_scale |
Scale factor for SAM2 inference | 0.5 |
processing.yolo_confidence |
YOLO detection confidence threshold | 0.6 |
processing.detect_segments |
Segments to process ("all" or list) | "all" |
Model Configuration
| Parameter | Description | Default |
|---|---|---|
models.yolo_model |
YOLO model path or name | "yolov8n.pt" |
models.sam2_checkpoint |
SAM2 checkpoint path | Required |
models.sam2_config |
SAM2 config file path | Required |
Video Settings
| Parameter | Description | Default |
|---|---|---|
video.use_nvenc |
Use NVIDIA hardware encoding | true |
video.output_bitrate |
Output video bitrate | "50M" |
video.preserve_audio |
Copy original audio track | true |
video.force_keyframes |
Force keyframes for clean cuts | true |
Advanced Options
| Parameter | Description | Default |
|---|---|---|
advanced.green_color |
Green screen RGB color | [0, 255, 0] |
advanced.blue_color |
Blue screen RGB color | [255, 0, 0] |
advanced.human_class_id |
YOLO human class ID | 0 |
advanced.log_level |
Logging verbosity | "INFO" |
advanced.cleanup_intermediate_files |
Clean up temp files | true |
Processing Pipeline
Step 1: Video Segmentation
- Splits input video into configurable segments (default 5 seconds)
- Creates organized directory structure:
video_segments/segment_0/,segment_1/, etc. - Uses FFmpeg with keyframe forcing for clean cuts
Step 2: Human Detection
- Runs YOLO detection on specified segments
- Detects human bounding boxes with configurable confidence threshold
- Saves detection results for reuse and debugging
Step 3: SAM2 Segmentation (In Development)
- Uses YOLO detections as prompts for SAM2
- Generates precise masks for detected humans
- Propagates masks across all frames in segments
Step 4: Green Screen Processing (In Development)
- Applies generated masks to isolate humans
- Replaces background with green screen
- Maintains video quality and framerate
Step 5: Video Assembly (In Development)
- Concatenates processed segments
- Preserves original audio track
- Outputs final video with green screen background
Project Structure
samyolo_on_segments/
├── README.md # This documentation
├── config.yaml # Default configuration
├── main.py # Main entry point
├── requirements.txt # Python dependencies
├── spec.md # Detailed specification
├── core/ # Core processing modules
│ ├── __init__.py
│ ├── config_loader.py # Configuration management
│ ├── sam2_processor.py # SAM2 segmentation (planned)
│ ├── video_splitter.py # Video segmentation
│ └── yolo_detector.py # YOLO human detection
└── utils/ # Utility modules
├── __init__.py
├── file_utils.py # File operations
├── logging_utils.py # Logging configuration
└── status_utils.py # Progress monitoring
Usage Examples
Basic Processing
python main.py --config config.yaml
Custom Configuration
python main.py --config my_custom_config.yaml --log-file processing.log
Process Specific Segments Only
processing:
detect_segments: [0, 5, 10, 15] # Only process these segments
High-Quality Processing
processing:
inference_scale: 1.0 # Full resolution inference
video:
output_bitrate: "100M" # Higher bitrate
Performance Considerations
Hardware Requirements
- GPU: NVIDIA GPU with 8GB+ VRAM (recommended)
- RAM: 16GB+ for processing large videos
- Storage: SSD recommended for temporary files
Processing Time
- Approximately 1-2x real-time on modern GPUs
- Scales with video resolution and number of segments
- YOLO detection: ~1-2 seconds per segment
- SAM2 processing: ~10-30 seconds per segment (estimated)
Optimization Tips
- Use
inference_scale: 0.5for faster processing - Process only key segments with
detect_segmentslist - Enable NVENC for hardware-accelerated encoding
- Use SSD storage for temporary files
Troubleshooting
Common Issues
ImportError: No module named 'sam2'
pip install git+https://github.com/facebookresearch/sam2.git
CUDA out of memory
- Reduce
inference_scaleto 0.25 or 0.5 - Process fewer segments at once
- Use a smaller YOLO model (yolov8n.pt instead of yolov8x.pt)
FFmpeg not found
# Ubuntu/Debian
sudo apt install ffmpeg
# macOS
brew install ffmpeg
# Windows
# Download from https://ffmpeg.org/download.html
No humans detected
- Lower
yolo_confidencethreshold - Check that humans are clearly visible in the video
- Verify the input video format is supported
Debug Mode
Enable detailed logging:
advanced:
log_level: "DEBUG"
Current Status
Implemented:
- ✅ Video segmentation with FFmpeg
- ✅ YOLO human detection
- ✅ Configuration management
- ✅ Progress monitoring
- ✅ Segment cleanup utilities
In Development:
- 🚧 SAM2 integration and mask generation
- 🚧 Green screen processing
- 🚧 Video assembly with audio
Planned:
- 📋 Multi-object tracking
- 📋 Real-time processing support
- 📋 Web interface
- 📋 Cloud processing integration
Contributing
This project is under active development. The core detection pipeline is functional, with SAM2 integration and green screen processing coming soon.
License
[Add your license information here]
Support
For issues and questions:
- Check the troubleshooting section
- Review the logs with
log_level: "DEBUG" - Open an issue with your configuration and error details