inital commit

2025-07-27 11:43:07 -07:00
commit b20a8281e5
13 changed files with 1877 additions and 0 deletions
--- a/spec.md
+++ b/spec.md
@@ -0,0 +1,192 @@
+# YOLO + SAM2 Video Processing Pipeline
+
+## Overview
+
+This project provides an automated video processing pipeline that uses YOLO for human detection and SAM2 for precise segmentation to create green screen videos. The system processes long videos by splitting them into manageable segments, detecting and tracking humans in each segment, and then reassembling the processed segments into a final output video with preserved audio.
+
+## Core Functionality
+
+### Input
+- **Long video file** (MP4 format, any duration)
+- **Configuration file** (YAML format) specifying processing parameters
+
+### Output  
+- **Processed video file** with humans visible and background replaced with green screen
+- **Preserved audio** from the original input video
+- **Intermediate files** for debugging and quality control
+
+## Processing Pipeline
+
+### 1. Video Segmentation
+- Splits input video into configurable-duration segments (default: 5 seconds)
+- Creates organized directory structure: `segment_0/`, `segment_1/`, etc.
+- Each segment folder contains the segment video file
+- Generates force keyframes for consistent encoding
+
+### 2. Human Detection & Tracking
+- **YOLO Detection**: Automatically detects humans in keyframe segments using YOLOv8
+- **SAM2 Segmentation**: Uses detected bounding boxes as prompts for precise mask generation
+- **Mask Propagation**: Propagates masks across all frames in each segment
+- **Stereo Video Support**: Handles VR/stereo content with left/right human assignment
+- **Continuity**: Non-keyframe segments use previous segment masks for consistency
+
+### 3. Green Screen Processing
+- **Mask Application**: Applies generated masks to isolate humans
+- **Background Replacement**: Replaces non-human areas with green screen (RGB: 0,255,0)
+- **GPU Acceleration**: Uses CuPy for fast mask processing
+- **Multi-resolution**: Low-res inference for speed, full-res final rendering
+
+### 4. Video Assembly
+- **Segment Concatenation**: Combines all processed segments into single video
+- **Audio Preservation**: Copies original audio track to final output
+- **Quality Maintenance**: Preserves original video quality and framerate
+
+## Key Features
+
+### Automated Processing
+- **No Manual Intervention**: Fully automated human detection eliminates manual point selection
+- **Batch Processing**: Processes multiple segments efficiently
+- **Smart Fallback**: Robust mask propagation with intelligent previous-segment loading
+
+### Modular Architecture
+- **Configuration-Driven**: YAML-based configuration for easy parameter adjustment
+- **Extensible Design**: Modular structure allows for easy feature additions
+- **Error Recovery**: Graceful handling of detection failures and missing segments
+
+### Performance Optimizations
+- **GPU Acceleration**: CUDA/NVENC support for faster processing
+- **Memory Management**: Efficient handling of large videos through segmentation
+- **Concurrent Processing**: Thread-safe operations where applicable
+
+## Technical Stack
+
+### Core Dependencies
+- **SAM2**: Facebook's Segment Anything Model 2 for precise segmentation
+- **YOLOv8 (Ultralytics)**: Human detection and bounding box generation
+- **OpenCV**: Video processing and frame manipulation
+- **CuPy**: GPU-accelerated array operations
+- **FFmpeg**: Video encoding/decoding and audio handling
+- **PyTorch**: Deep learning framework backend
+
+### Supported Formats
+- **Input Video**: MP4, AVI, MOV (any OpenCV-supported format)
+- **Output Video**: MP4 with H.265/HEVC encoding
+- **Audio**: Preserves original audio codec and quality
+
+## Configuration Options
+
+### Video Processing
+- `segment_duration`: Duration of each video segment (seconds)
+- `inference_scale`: Scale factor for SAM2 inference (for speed)
+- `output_scale`: Scale factor for final output
+
+### Detection Parameters
+- `yolo_model`: Path to YOLO model weights
+- `yolo_confidence`: Detection confidence threshold
+- `detect_segments`: Which segments to run YOLO detection on
+
+### SAM2 Parameters
+- `sam2_checkpoint`: Path to SAM2 model weights
+- `sam2_config`: SAM2 model configuration file
+
+### Output Options
+- `use_nvenc`: Enable NVIDIA hardware encoding
+- `output_bitrate`: Video bitrate for final output
+- `preserve_audio`: Whether to copy audio track
+
+## Directory Structure
+
+```
+new_yolo/
+├── spec.md                 # This specification document
+├── requirements.txt        # Python dependencies
+├── config.yaml            # Default configuration file
+├── main.py                # Entry point script
+├── core/
+│   ├── __init__.py
+│   ├── video_splitter.py   # Video segmentation logic
+│   ├── yolo_detector.py    # YOLO human detection
+│   ├── sam2_processor.py   # SAM2 segmentation
+│   ├── mask_processor.py   # Mask application and green screen
+│   ├── video_assembler.py  # Final video assembly
+│   └── config_loader.py    # Configuration management
+├── utils/
+│   ├── __init__.py
+│   ├── file_utils.py       # File system operations
+│   ├── video_utils.py      # Video processing utilities
+│   └── logging_utils.py    # Logging configuration
+└── examples/
+    ├── basic_config.yaml   # Example configuration
+    └── advanced_config.yaml # Advanced configuration options
+```
+
+## Usage Examples
+
+### Basic Usage
+```bash
+python main.py --config config.yaml
+```
+
+### Custom Configuration
+```bash
+python main.py --config examples/advanced_config.yaml
+```
+
+### Configuration File Example
+```yaml
+input:
+  video_path: "/path/to/input/video.mp4"
+  
+output:
+  directory: "/path/to/output/"
+  filename: "processed_video.mp4"
+  
+processing:
+  segment_duration: 5
+  inference_scale: 0.5
+  yolo_confidence: 0.6
+  detect_segments: "all"  # or [0, 5, 10]
+  
+models:
+  yolo_model: "yolov8n.pt"
+  sam2_checkpoint: "../checkpoints/sam2.1_hiera_large.pt"
+  sam2_config: "configs/sam2.1/sam2.1_hiera_l.yaml"
+```
+
+## Use Cases
+
+### Content Creation
+- **VR/360 Video Processing**: Remove backgrounds from immersive content
+- **Green Screen Production**: Automated background removal for video production
+- **Social Media Content**: Quick background replacement for content creators
+
+### Commercial Applications  
+- **Video Conferencing**: Real-time background replacement
+- **E-learning**: Professional video production with clean backgrounds
+- **Marketing**: Product demonstration videos with custom backgrounds
+
+## Performance Considerations
+
+### Hardware Requirements
+- **GPU**: NVIDIA GPU with CUDA support (recommended)
+- **RAM**: 16GB+ for processing large videos
+- **Storage**: SSD recommended for temporary file operations
+
+### Processing Time
+- Approximately **1-2x real-time** on modern GPUs
+- Scales with video resolution and segment count
+- Memory usage remains constant regardless of input video length
+
+## Future Enhancements
+
+### Planned Features
+- **Multi-object Tracking**: Support for multiple humans per frame
+- **Custom Object Detection**: Configurable object classes beyond humans
+- **Real-time Processing**: Live video stream support
+- **Cloud Integration**: AWS/GCP processing support
+- **Web Interface**: Browser-based configuration and monitoring
+
+### Model Improvements
+- **Fine-tuned YOLO**: Domain-specific human detection models
+- **SAM2 Optimization**: Custom SAM2 checkpoints for video content
+- **Temporal Consistency**: Enhanced cross-segment mask propagation