stereo mask working

2025-07-31 11:13:31 -07:00
parent 0057017ac4
commit b97a3752a7
8 changed files with 1247 additions and 206 deletions
--- a/claude.md
+++ b/claude.md
@@ -0,0 +1,230 @@
+# YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide
+
+## Project Overview
+
+This repository implements an automated video processing pipeline specifically designed for **VR180 side-by-side stereo videos**. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.
+
+## Core Purpose
+
+The primary goal is to automatically create green screen videos from VR180 content where:
+- **Left eye view** (left half of frame) contains humans as Object 1 (green masks)
+- **Right eye view** (right half of frame) contains humans as Object 2 (blue masks)
+- Background is replaced with pure green (RGB: 0,255,0) for chroma keying
+- Original audio is preserved throughout the process
+- Processing handles videos of any length through segmentation
+
+## Architecture Overview
+
+### Pipeline Stages
+
+1. **Video Segmentation** (`core/video_splitter.py`)
+   - Splits long videos into 5-second segments using FFmpeg
+   - Creates organized directory structure: `segment_0/`, `segment_1/`, etc.
+   - Preserves timestamps and forces keyframes for clean cuts
+
+2. **Human Detection** (`core/yolo_detector.py`)
+   - Uses YOLOv8 for robust human detection in VR180 format
+   - Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
+   - Automatically assigns humans to left/right eye based on position in frame
+   - Saves detection results for reuse and debugging
+
+3. **Mask Generation** (`core/sam2_processor.py`)
+   - Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
+   - Propagates masks across all frames in each segment
+   - Supports mask continuity between segments using previous segment's final masks
+   - Handles VR180 stereo tracking with separate object IDs for each eye
+
+4. **Green Screen Processing** (`core/mask_processor.py`)
+   - Applies generated masks to isolate humans
+   - Replaces background with green screen
+   - Uses GPU acceleration (CuPy) for fast processing
+   - Maintains original video quality and framerate
+
+5. **Video Assembly** (`core/video_assembler.py`)
+   - Concatenates all processed segments into final video
+   - Preserves original audio track from input video
+   - Uses hardware encoding (NVENC) when available
+
+### Key Components
+
+```
+samyolo_on_segments/
+├── main.py                 # Entry point - orchestrates the pipeline
+├── config.yaml            # Configuration file (YAML format)
+├── core/                  # Core processing modules
+│   ├── config_loader.py   # Configuration management
+│   ├── video_splitter.py  # FFmpeg-based video segmentation
+│   ├── yolo_detector.py   # YOLO human detection (detection/segmentation modes)
+│   ├── sam2_processor.py  # SAM2 mask generation and propagation
+│   ├── mask_processor.py  # Green screen application
+│   └── video_assembler.py # Final video concatenation
+├── utils/                 # Utility functions
+│   ├── file_utils.py      # File system operations
+│   ├── logging_utils.py   # Logging configuration
+│   └── status_utils.py    # Progress monitoring
+└── models/                # Model storage (created by download_models.py)
+    ├── sam2/             # SAM2 checkpoints and configs
+    └── yolo/             # YOLO model weights
+```
+
+## VR180 Specific Features
+
+### Stereo Video Handling
+- Automatically detects humans in left and right eye views
+- Assigns Object ID 1 to left eye humans (green masks)
+- Assigns Object ID 2 to right eye humans (blue masks)
+- Maintains stereo correspondence throughout segments
+
+### Frame Division Logic
+- Frame width is divided in half to separate left/right views
+- Human detection centers are used to determine eye assignment
+- If only one human is detected, it may be duplicated to both eyes (configurable)
+
+## Configuration System
+
+The pipeline is controlled via `config.yaml` with these key sections:
+
+### Essential Settings
+```yaml
+input:
+  video_path: "/path/to/vr180_video.mp4"
+
+output:
+  directory: "/path/to/output/"
+  filename: "greenscreen_output.mp4"
+
+processing:
+  segment_duration: 5              # Seconds per segment
+  inference_scale: 0.5            # Scale for faster processing
+  yolo_confidence: 0.6            # Detection threshold
+  detect_segments: "all"          # Which segments to process
+
+models:
+  yolo_model: "models/yolo/yolov8n.pt"
+  sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
+  sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"
+```
+
+### Advanced Options
+- **YOLO Modes**: Switch between detection (bboxes) and segmentation (direct masks)
+- **Mid-segment Detection**: Re-detect humans at intervals within segments
+- **Mask Quality**: Temporal smoothing, morphological operations, edge refinement
+- **Debug Outputs**: Save detection visualizations and first-frame masks
+
+## Processing Flow
+
+### For First Segment (segment_0):
+1. Load first frame at inference scale
+2. Run YOLO to detect humans
+3. Convert detections to SAM2 prompts (or use YOLO masks directly)
+4. Initialize SAM2 with prompts/masks
+5. Propagate masks through all frames
+6. Apply green screen and save output
+7. Save final mask for next segment
+
+### For Subsequent Segments:
+1. Check if YOLO detection is requested for this segment
+2. If yes: Use YOLO detection (same as first segment)
+3. If no: Load previous segment's final mask
+4. Initialize SAM2 with previous masks
+5. Continue propagation through segment
+6. Apply green screen and save output
+
+### Fallback Logic:
+- If no previous mask exists, searches backwards through segments
+- First segment always requires YOLO detection
+- Missing detections can be recovered in later segments
+
+## Model Support
+
+### YOLO Models
+- **Detection**: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
+- **Segmentation**: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)
+
+### SAM2 Models
+- **Tiny**: sam2.1_hiera_tiny.pt (fastest, lowest quality)
+- **Small**: sam2.1_hiera_small.pt
+- **Base+**: sam2.1_hiera_base_plus.pt
+- **Large**: sam2.1_hiera_large.pt (best quality, slowest)
+
+## Key Implementation Details
+
+### GPU Optimization
+- CUDA device selection with MPS fallback
+- CuPy for GPU-accelerated mask operations
+- NVENC hardware encoding support
+- Batch processing where possible
+
+### Memory Management
+- Segments processed sequentially to limit memory usage
+- Explicit garbage collection between segments
+- Low-resolution inference with high-resolution rendering
+- Configurable scale factors for different stages
+
+### Error Handling
+- Graceful fallback when masks are unavailable
+- Segment-level recovery (can restart individual segments)
+- Comprehensive logging at all stages
+- Status checking and cleanup utilities
+
+## Debugging Features
+
+### Status Monitoring
+```bash
+python main.py --config config.yaml --status
+```
+
+### Segment Cleanup
+```bash
+python main.py --config config.yaml --cleanup-segment 5
+```
+
+### Debug Outputs
+- `yolo_debug.jpg`: Bounding box visualizations
+- `first_frame_detection.jpg`: Initial mask visualization
+- `mask.png`: Final segment mask for continuity
+- `yolo_detections`: Saved detection coordinates
+
+## Common Issues and Solutions
+
+### No Right Eye Detections in VR180
+- Lower `yolo_confidence` threshold (try 0.3-0.4)
+- Enable debug mode to analyze detection confidence
+- Check if person is actually visible in right eye view
+
+### Mask Propagation Failures
+- Ensure first segment has successful YOLO detections
+- Check previous segment's mask.png exists
+- Consider re-running YOLO on problem segments
+
+### Memory Issues
+- Reduce `inference_scale` (try 0.25)
+- Use smaller models (tiny/small variants)
+- Process fewer segments at once
+
+## Development Notes
+
+### Adding Features
+- All core modules inherit from base classes in `core/`
+- Configuration is centralized through `ConfigLoader`
+- Logging uses Python's standard logging module
+- File operations go through `utils/file_utils.py`
+
+### Testing Components
+- Each module can be tested independently
+- Use `--status` flag to check processing state
+- Debug outputs help verify each stage
+
+### Performance Tuning
+- Adjust `inference_scale` for speed vs quality
+- Use `detect_segments` to process only key frames
+- Enable `use_nvenc` for hardware encoding
+- Consider `vos_optimized` mode for SAM2 (experimental)
+
+## Original Monolithic Script
+
+The project includes the original working script in `spec.md` (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:
+- Better error handling and recovery
+- Configurable processing pipeline
+- Debug and monitoring capabilities
+- Cleaner code organization