Compare commits

..

3 Commits

Author SHA1 Message Date
b97a3752a7 stereo mask working 2025-07-31 11:13:31 -07:00
0057017ac4 working a bit faster 2025-07-31 09:09:22 -07:00
70044e1b10 sbs working phase 1 2025-07-30 18:07:26 -07:00
14 changed files with 4119 additions and 220 deletions

230
claude.md Normal file
View File

@@ -0,0 +1,230 @@
# YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide
## Project Overview
This repository implements an automated video processing pipeline specifically designed for **VR180 side-by-side stereo videos**. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.
## Core Purpose
The primary goal is to automatically create green screen videos from VR180 content where:
- **Left eye view** (left half of frame) contains humans as Object 1 (green masks)
- **Right eye view** (right half of frame) contains humans as Object 2 (blue masks)
- Background is replaced with pure green (RGB: 0,255,0) for chroma keying
- Original audio is preserved throughout the process
- Processing handles videos of any length through segmentation
## Architecture Overview
### Pipeline Stages
1. **Video Segmentation** (`core/video_splitter.py`)
- Splits long videos into 5-second segments using FFmpeg
- Creates organized directory structure: `segment_0/`, `segment_1/`, etc.
- Preserves timestamps and forces keyframes for clean cuts
2. **Human Detection** (`core/yolo_detector.py`)
- Uses YOLOv8 for robust human detection in VR180 format
- Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
- Automatically assigns humans to left/right eye based on position in frame
- Saves detection results for reuse and debugging
3. **Mask Generation** (`core/sam2_processor.py`)
- Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
- Propagates masks across all frames in each segment
- Supports mask continuity between segments using previous segment's final masks
- Handles VR180 stereo tracking with separate object IDs for each eye
4. **Green Screen Processing** (`core/mask_processor.py`)
- Applies generated masks to isolate humans
- Replaces background with green screen
- Uses GPU acceleration (CuPy) for fast processing
- Maintains original video quality and framerate
5. **Video Assembly** (`core/video_assembler.py`)
- Concatenates all processed segments into final video
- Preserves original audio track from input video
- Uses hardware encoding (NVENC) when available
### Key Components
```
samyolo_on_segments/
├── main.py # Entry point - orchestrates the pipeline
├── config.yaml # Configuration file (YAML format)
├── core/ # Core processing modules
│ ├── config_loader.py # Configuration management
│ ├── video_splitter.py # FFmpeg-based video segmentation
│ ├── yolo_detector.py # YOLO human detection (detection/segmentation modes)
│ ├── sam2_processor.py # SAM2 mask generation and propagation
│ ├── mask_processor.py # Green screen application
│ └── video_assembler.py # Final video concatenation
├── utils/ # Utility functions
│ ├── file_utils.py # File system operations
│ ├── logging_utils.py # Logging configuration
│ └── status_utils.py # Progress monitoring
└── models/ # Model storage (created by download_models.py)
├── sam2/ # SAM2 checkpoints and configs
└── yolo/ # YOLO model weights
```
## VR180 Specific Features
### Stereo Video Handling
- Automatically detects humans in left and right eye views
- Assigns Object ID 1 to left eye humans (green masks)
- Assigns Object ID 2 to right eye humans (blue masks)
- Maintains stereo correspondence throughout segments
### Frame Division Logic
- Frame width is divided in half to separate left/right views
- Human detection centers are used to determine eye assignment
- If only one human is detected, it may be duplicated to both eyes (configurable)
## Configuration System
The pipeline is controlled via `config.yaml` with these key sections:
### Essential Settings
```yaml
input:
video_path: "/path/to/vr180_video.mp4"
output:
directory: "/path/to/output/"
filename: "greenscreen_output.mp4"
processing:
segment_duration: 5 # Seconds per segment
inference_scale: 0.5 # Scale for faster processing
yolo_confidence: 0.6 # Detection threshold
detect_segments: "all" # Which segments to process
models:
yolo_model: "models/yolo/yolov8n.pt"
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"
```
### Advanced Options
- **YOLO Modes**: Switch between detection (bboxes) and segmentation (direct masks)
- **Mid-segment Detection**: Re-detect humans at intervals within segments
- **Mask Quality**: Temporal smoothing, morphological operations, edge refinement
- **Debug Outputs**: Save detection visualizations and first-frame masks
## Processing Flow
### For First Segment (segment_0):
1. Load first frame at inference scale
2. Run YOLO to detect humans
3. Convert detections to SAM2 prompts (or use YOLO masks directly)
4. Initialize SAM2 with prompts/masks
5. Propagate masks through all frames
6. Apply green screen and save output
7. Save final mask for next segment
### For Subsequent Segments:
1. Check if YOLO detection is requested for this segment
2. If yes: Use YOLO detection (same as first segment)
3. If no: Load previous segment's final mask
4. Initialize SAM2 with previous masks
5. Continue propagation through segment
6. Apply green screen and save output
### Fallback Logic:
- If no previous mask exists, searches backwards through segments
- First segment always requires YOLO detection
- Missing detections can be recovered in later segments
## Model Support
### YOLO Models
- **Detection**: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
- **Segmentation**: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)
### SAM2 Models
- **Tiny**: sam2.1_hiera_tiny.pt (fastest, lowest quality)
- **Small**: sam2.1_hiera_small.pt
- **Base+**: sam2.1_hiera_base_plus.pt
- **Large**: sam2.1_hiera_large.pt (best quality, slowest)
## Key Implementation Details
### GPU Optimization
- CUDA device selection with MPS fallback
- CuPy for GPU-accelerated mask operations
- NVENC hardware encoding support
- Batch processing where possible
### Memory Management
- Segments processed sequentially to limit memory usage
- Explicit garbage collection between segments
- Low-resolution inference with high-resolution rendering
- Configurable scale factors for different stages
### Error Handling
- Graceful fallback when masks are unavailable
- Segment-level recovery (can restart individual segments)
- Comprehensive logging at all stages
- Status checking and cleanup utilities
## Debugging Features
### Status Monitoring
```bash
python main.py --config config.yaml --status
```
### Segment Cleanup
```bash
python main.py --config config.yaml --cleanup-segment 5
```
### Debug Outputs
- `yolo_debug.jpg`: Bounding box visualizations
- `first_frame_detection.jpg`: Initial mask visualization
- `mask.png`: Final segment mask for continuity
- `yolo_detections`: Saved detection coordinates
## Common Issues and Solutions
### No Right Eye Detections in VR180
- Lower `yolo_confidence` threshold (try 0.3-0.4)
- Enable debug mode to analyze detection confidence
- Check if person is actually visible in right eye view
### Mask Propagation Failures
- Ensure first segment has successful YOLO detections
- Check previous segment's mask.png exists
- Consider re-running YOLO on problem segments
### Memory Issues
- Reduce `inference_scale` (try 0.25)
- Use smaller models (tiny/small variants)
- Process fewer segments at once
## Development Notes
### Adding Features
- All core modules inherit from base classes in `core/`
- Configuration is centralized through `ConfigLoader`
- Logging uses Python's standard logging module
- File operations go through `utils/file_utils.py`
### Testing Components
- Each module can be tested independently
- Use `--status` flag to check processing state
- Debug outputs help verify each stage
### Performance Tuning
- Adjust `inference_scale` for speed vs quality
- Use `detect_segments` to process only key frames
- Enable `use_nvenc` for hardware encoding
- Consider `vos_optimized` mode for SAM2 (experimental)
## Original Monolithic Script
The project includes the original working script in `spec.md` (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:
- Better error handling and recovery
- Configurable processing pipeline
- Debug and monitoring capabilities
- Cleaner code organization

View File

@@ -1,62 +1,137 @@
# YOLO + SAM2 Video Processing Configuration
# This file serves as a complete reference for all available settings.
input:
# Full path to the input video file.
video_path: "/path/to/input/video.mp4"
output:
# Directory where all output files and segments will be stored.
directory: "/path/to/output/"
# Filename for the final assembled video.
filename: "processed_video.mp4"
processing:
# Duration of each video segment in seconds
# Duration of each video segment in seconds. Shorter segments use less memory.
segment_duration: 5
# Scale factor for SAM2 inference (0.5 = half resolution)
# Scale factor for SAM2 inference (e.g., 0.5 = half resolution).
# Lower values are faster but may reduce mask quality.
inference_scale: 0.5
# YOLO detection confidence threshold
# YOLO detection confidence threshold (0.0 to 1.0).
yolo_confidence: 0.6
# Which segments to run YOLO detection on
# Options: "all", [0, 5, 10], or [] for default (all)
# Which segments to run YOLO detection on.
# Options: "all", a list of specific segment indices (e.g., [0, 10, 20]), or [] for default ("all").
detect_segments: "all"
# --- VR180 Stereo Processing ---
# Enables special logic for VR180 SBS video. When false, video is treated as a single view.
separate_eye_processing: false
# Threshold for stereo mask agreement (Intersection over Union).
# A value of 0.5 means masks must overlap by 50% to be considered a pair.
stereo_iou_threshold: 0.5
# Factor to reduce YOLO confidence by if no stereo pairs are found on the first try (e.g., 0.8 = 20% reduction).
confidence_reduction_factor: 0.8
# If no humans are detected in a segment, create a full green screen video.
# Only used when separate_eye_processing is true.
enable_greenscreen_fallback: true
# Pixel overlap between left/right eyes for smoother blending at the center seam.
eye_overlap_pixels: 0
models:
# YOLO model path - can be pretrained (yolov8n.pt) or custom path
yolo_model: "models/yolo/yolov8n.pt"
# YOLO mode: "detection" (for bounding boxes) or "segmentation" (for direct masks).
# "segmentation" is generally recommended as it provides initial masks to SAM2.
yolo_mode: "segmentation"
# SAM2 model configuration
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"
# Path to the YOLO model for "detection" mode.
yolo_detection_model: "models/yolo/yolo11l.pt"
# Path to the YOLO model for "segmentation" mode.
yolo_segmentation_model: "models/yolo/yolo11x-seg.pt"
# --- SAM2 Model Configuration ---
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_small.pt"
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_s.yaml"
# (Experimental) Use optimized VOS predictor for a significant speedup. Requires PyTorch 2.5.1+.
sam2_vos_optimized: false
video:
# Use NVIDIA hardware encoding (requires NVENC-capable GPU)
# Use NVIDIA's NVENC for hardware-accelerated video encoding.
use_nvenc: true
# Output video bitrate
# Bitrate for the output video (e.g., "25M", "50M").
output_bitrate: "50M"
# Preserve original audio track
# If true, the audio track from the input video will be copied to the final output.
preserve_audio: true
# Force keyframes for better segment boundaries
# Force keyframes at the start of each segment for clean cuts. Recommended to keep true.
force_keyframes: true
advanced:
# Green screen color (RGB values)
# RGB color for the green screen background.
green_color: [0, 255, 0]
# Blue screen color for second object (RGB values)
# RGB color for the second object's mask (typically the right eye in VR180).
blue_color: [255, 0, 0]
# YOLO human class ID (0 for COCO person class)
# The class ID for humans in the YOLO model (COCO default is 0 for "person").
human_class_id: 0
# GPU memory management
# If true, deletes intermediate files like segment videos after processing.
cleanup_intermediate_files: true
# Logging level (DEBUG, INFO, WARNING, ERROR)
# Logging level: DEBUG, INFO, WARNING, ERROR.
log_level: "INFO"
# Save debug frames with YOLO detections visualized
# If true, saves debug images for YOLO detections.
save_yolo_debug_frames: true
# --- Mid-Segment Re-detection ---
# Re-run YOLO at intervals within a segment to correct tracking drift.
enable_mid_segment_detection: false
redetection_interval: 30 # Frames between re-detections.
max_redetections_per_segment: 10
# --- Parallel Processing Optimizations ---
# (Experimental) Generate low-res videos for upcoming segments in the background.
enable_background_lowres_generation: false
max_concurrent_lowres: 2 # Max parallel FFmpeg processes.
lowres_segments_ahead: 2 # How many segments to prepare in advance.
use_ffmpeg_lowres: true # Use FFmpeg (faster) instead of OpenCV for low-res creation.
# --- Mask Quality Enhancement Settings ---
# These settings allow fine-tuning of the final mask appearance.
# Enabling these may increase processing time.
mask_processing:
# Edge feathering and blurring for smoother transitions.
enable_edge_blur: true
edge_blur_radius: 3
edge_blur_sigma: 0.5
# Temporal smoothing to reduce mask flickering between frames.
enable_temporal_smoothing: false
temporal_blend_weight: 0.2
temporal_history_frames: 2
# Clean up small noise and holes in the mask.
# Generally not needed when using SAM2, as its masks are high quality.
enable_morphological_cleaning: false
morphology_kernel_size: 5
min_component_size: 500
# Method for blending the mask edge with the background.
# Options: "linear" (fastest), "gaussian", "sigmoid".
alpha_blending_mode: "linear"
alpha_transition_width: 1
# Advanced edge-preserving smoothing filter. Slower but can produce higher quality edges.
enable_bilateral_filter: false
bilateral_d: 9
bilateral_sigma_color: 75
bilateral_sigma_space: 75

View File

@@ -1,2 +1,4 @@
# YOLO + SAM2 Video Processing Pipeline
# Core modules for video processing with human detection and segmentation
# Core modules for video processing with human detection and segmentation
from .eye_processor import EyeProcessor

View File

@@ -0,0 +1,337 @@
"""
Async low-resolution video preprocessor for parallel processing optimization.
Creates low-resolution videos in background while main pipeline processes other segments.
"""
import os
import asyncio
import subprocess
import logging
import threading
from pathlib import Path
from typing import List, Dict, Any, Optional
from concurrent.futures import ThreadPoolExecutor
logger = logging.getLogger(__name__)
class AsyncLowResPreprocessor:
"""
Handles async pre-generation of low-resolution videos for SAM2 inference.
Uses FFmpeg subprocesses to bypass Python GIL limitations.
"""
def __init__(self, max_concurrent: int = 3, segments_ahead: int = 3, use_ffmpeg: bool = True):
"""
Initialize async preprocessor.
Args:
max_concurrent: Maximum number of concurrent FFmpeg processes
segments_ahead: How many segments to prepare in advance
use_ffmpeg: Use FFmpeg instead of OpenCV for better performance
"""
self.max_concurrent = max_concurrent
self.segments_ahead = segments_ahead
self.use_ffmpeg = use_ffmpeg
self.preparation_tasks = {} # segment_idx -> threading.Thread
self.completed_segments = set() # Track completed preparations
self.active_threads = [] # Track active background threads
logger.info(f"AsyncLowResPreprocessor initialized: max_concurrent={max_concurrent}, "
f"segments_ahead={segments_ahead}, use_ffmpeg={use_ffmpeg}")
async def create_lowres_ffmpeg(self, input_path: str, output_path: str, scale: float, semaphore: asyncio.Semaphore) -> bool:
"""
Create low-resolution video using FFmpeg (bypasses Python GIL).
Args:
input_path: Path to input video
output_path: Path to output low-res video
scale: Scale factor for resolution reduction
semaphore: Asyncio semaphore for limiting concurrent processes
Returns:
True if successful
"""
async with semaphore: # Limit concurrent FFmpeg processes
try:
# Ensure output directory exists
os.makedirs(os.path.dirname(output_path), exist_ok=True)
# FFmpeg command for fast low-res video creation
cmd = [
'ffmpeg', '-y', # Overwrite output
'-i', input_path,
'-vf', f'scale=iw*{scale}:ih*{scale}',
'-c:v', 'libx264',
'-preset', 'ultrafast', # Fastest encoding
'-crf', '28', # Lower quality OK for inference
'-an', # No audio needed for inference
output_path
]
logger.debug(f"Starting FFmpeg low-res creation: {os.path.basename(input_path)} -> {os.path.basename(output_path)}")
# Run FFmpeg asynchronously
proc = await asyncio.create_subprocess_exec(
*cmd,
stdout=asyncio.subprocess.DEVNULL,
stderr=asyncio.subprocess.PIPE
)
stdout, stderr = await proc.wait(), await proc.communicate()
if proc.returncode != 0:
stderr_text = stderr[1].decode() if stderr and len(stderr) > 1 else "Unknown error"
logger.error(f"FFmpeg failed for {input_path}: {stderr_text}")
return False
# Verify output file was created
if not os.path.exists(output_path) or os.path.getsize(output_path) == 0:
logger.error(f"FFmpeg output file missing or empty: {output_path}")
return False
logger.debug(f"FFmpeg low-res creation completed: {os.path.basename(output_path)}")
return True
except Exception as e:
logger.error(f"Error in FFmpeg low-res creation for {input_path}: {e}")
return False
def create_lowres_opencv(self, input_path: str, output_path: str, scale: float) -> bool:
"""
Fallback: Create low-resolution video using OpenCV (blocking operation).
Used when FFmpeg is not available or fails.
Args:
input_path: Path to input video
output_path: Path to output low-res video
scale: Scale factor for resolution reduction
Returns:
True if successful
"""
try:
import cv2
logger.debug(f"Creating low-res video with OpenCV: {os.path.basename(input_path)}")
cap = cv2.VideoCapture(input_path)
if not cap.isOpened():
logger.error(f"Could not open video with OpenCV: {input_path}")
return False
# Get video properties
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH) * scale)
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT) * scale)
fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
# Ensure output directory exists
os.makedirs(os.path.dirname(output_path), exist_ok=True)
# Create video writer
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_path, fourcc, fps, (frame_width, frame_height))
if not out.isOpened():
logger.error(f"Could not create video writer for: {output_path}")
cap.release()
return False
# Process frames
frame_count = 0
while True:
ret, frame = cap.read()
if not ret:
break
# Resize frame
low_res_frame = cv2.resize(frame, (frame_width, frame_height),
interpolation=cv2.INTER_LINEAR)
out.write(low_res_frame)
frame_count += 1
# Cleanup
cap.release()
out.release()
logger.debug(f"OpenCV low-res creation completed: {frame_count} frames -> {os.path.basename(output_path)}")
return True
except Exception as e:
logger.error(f"Error in OpenCV low-res creation for {input_path}: {e}")
return False
async def create_lowres_video_async(self, input_path: str, output_path: str, scale: float, semaphore: asyncio.Semaphore) -> bool:
"""
Create low-resolution video using the configured method (FFmpeg or OpenCV).
Args:
input_path: Path to input video
output_path: Path to output low-res video
scale: Scale factor for resolution reduction
semaphore: Asyncio semaphore for limiting concurrent processes
Returns:
True if successful
"""
# Skip if already exists
if os.path.exists(output_path) and os.path.getsize(output_path) > 0:
logger.debug(f"Low-res video already exists: {os.path.basename(output_path)}")
return True
if self.use_ffmpeg:
# Try FFmpeg first
success = await self.create_lowres_ffmpeg(input_path, output_path, scale, semaphore)
if success:
return True
logger.warning(f"FFmpeg failed for {input_path}, falling back to OpenCV")
# Fallback to OpenCV (run in thread pool to avoid blocking)
loop = asyncio.get_event_loop()
with ThreadPoolExecutor(max_workers=1) as executor:
success = await loop.run_in_executor(
executor, self.create_lowres_opencv, input_path, output_path, scale
)
return success
async def prepare_segment_lowres(self, segment_info: Dict[str, Any], scale: float,
separate_eye_processing: bool = False, semaphore: asyncio.Semaphore = None) -> bool:
"""
Prepare low-resolution videos for a segment (regular or eye-specific).
Args:
segment_info: Segment information dictionary
scale: Scale factor for resolution reduction
separate_eye_processing: Whether to prepare eye-specific videos
semaphore: Asyncio semaphore for limiting concurrent processes
Returns:
True if all videos were prepared successfully
"""
segment_idx = segment_info['index']
segment_dir = segment_info['directory']
try:
if separate_eye_processing:
# Prepare low-res videos for left and right eyes
success_left = success_right = True
left_eye_path = os.path.join(segment_dir, "left_eye.mp4")
right_eye_path = os.path.join(segment_dir, "right_eye.mp4")
if os.path.exists(left_eye_path):
lowres_left_path = os.path.join(segment_dir, "low_res_left_eye_video.mp4")
success_left = await self.create_lowres_video_async(left_eye_path, lowres_left_path, scale, semaphore)
if os.path.exists(right_eye_path):
lowres_right_path = os.path.join(segment_dir, "low_res_right_eye_video.mp4")
success_right = await self.create_lowres_video_async(right_eye_path, lowres_right_path, scale, semaphore)
success = success_left and success_right
if success:
logger.info(f"Pre-generated low-res eye videos for segment {segment_idx}")
else:
logger.warning(f"Failed to pre-generate some eye videos for segment {segment_idx}")
else:
# Prepare regular low-res video
input_path = segment_info['video_file']
lowres_path = os.path.join(segment_dir, "low_res_video.mp4")
success = await self.create_lowres_video_async(input_path, lowres_path, scale, semaphore)
if success:
logger.info(f"Pre-generated low-res video for segment {segment_idx}")
else:
logger.warning(f"Failed to pre-generate low-res video for segment {segment_idx}")
if success:
self.completed_segments.add(segment_idx)
return success
except Exception as e:
logger.error(f"Error preparing low-res videos for segment {segment_idx}: {e}")
return False
def start_background_preparation(self, segments_info: List[Dict[str, Any]], scale: float,
separate_eye_processing: bool = False, current_segment: int = 0):
"""
Start preparing upcoming segments in background using threads.
Args:
segments_info: List of all segment information
scale: Scale factor for resolution reduction
separate_eye_processing: Whether to prepare eye-specific videos
current_segment: Index of currently processing segment
"""
def background_worker():
"""Background thread worker that prepares upcoming segments."""
try:
# Prepare segments ahead of current processing
start_idx = current_segment + 1
end_idx = min(len(segments_info), start_idx + self.segments_ahead)
segments_to_prepare = []
for i in range(start_idx, end_idx):
if i not in self.completed_segments and i not in self.preparation_tasks:
segments_to_prepare.append((i, segments_info[i]))
if segments_to_prepare:
logger.info(f"Starting background preparation for {len(segments_to_prepare)} segments (indices {start_idx}-{end_idx-1})")
# Run async work in new event loop
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
try:
# Create semaphore in this event loop
semaphore = asyncio.Semaphore(self.max_concurrent)
tasks = []
for segment_idx, segment_info in segments_to_prepare:
task = self.prepare_segment_lowres(segment_info, scale, separate_eye_processing, semaphore)
tasks.append(task)
# Run all preparation tasks
results = loop.run_until_complete(asyncio.gather(*tasks, return_exceptions=True))
# Mark completed segments
for i, (segment_idx, _) in enumerate(segments_to_prepare):
if i < len(results) and results[i] is True:
self.completed_segments.add(segment_idx)
logger.debug(f"Background preparation completed for segment {segment_idx}")
finally:
loop.close()
else:
logger.debug(f"No segments need preparation (current: {current_segment})")
except Exception as e:
logger.error(f"Error in background preparation worker: {e}")
# Start background thread
thread = threading.Thread(target=background_worker, daemon=True)
thread.start()
self.active_threads.append(thread)
def is_segment_ready(self, segment_idx: int) -> bool:
"""
Check if low-res videos for a segment are ready.
Args:
segment_idx: Index of segment to check
Returns:
True if segment is ready
"""
return segment_idx in self.completed_segments
def cleanup(self):
"""Clean up any running threads."""
# Note: daemon threads will be cleaned up automatically when main process exits
# We just clear our tracking structures
self.active_threads.clear()
self.preparation_tasks.clear()
logger.debug("AsyncLowResPreprocessor cleanup completed")

View File

@@ -184,4 +184,12 @@ class ConfigLoader:
def should_cleanup_intermediate_files(self) -> bool:
"""Get whether to cleanup intermediate files."""
return self.config.get('advanced', {}).get('cleanup_intermediate_files', True)
return self.config.get('advanced', {}).get('cleanup_intermediate_files', True)
def get_stereo_iou_threshold(self) -> float:
"""Get the IOU threshold for stereo mask agreement."""
return self.config['processing'].get('stereo_iou_threshold', 0.5)
def get_confidence_reduction_factor(self) -> float:
"""Get the factor to reduce YOLO confidence by on retry."""
return self.config['processing'].get('confidence_reduction_factor', 0.8)

266
core/eye_processor.py Normal file
View File

@@ -0,0 +1,266 @@
"""
Eye processor module for VR180 separate eye processing.
Handles splitting VR180 side-by-side frames into separate left/right eyes and recombining.
"""
import os
import cv2
import numpy as np
import logging
import subprocess
from typing import Dict, List, Any, Optional, Tuple
logger = logging.getLogger(__name__)
class EyeProcessor:
"""Handles VR180 eye-specific processing operations."""
def __init__(self, eye_overlap_pixels: int = 0):
"""
Initialize eye processor.
Args:
eye_overlap_pixels: Number of pixels to overlap between eyes for blending
"""
self.eye_overlap_pixels = eye_overlap_pixels
def split_frame_into_eyes(self, frame: np.ndarray) -> Tuple[np.ndarray, np.ndarray]:
"""
Split a VR180 side-by-side frame into separate left and right eye frames.
Args:
frame: Input VR180 frame (BGR format)
Returns:
Tuple of (left_eye_frame, right_eye_frame)
"""
if len(frame.shape) != 3:
raise ValueError("Frame must be a 3-channel BGR image")
height, width, channels = frame.shape
half_width = width // 2
# Extract left and right eye frames
left_eye = frame[:, :half_width + self.eye_overlap_pixels, :]
right_eye = frame[:, half_width - self.eye_overlap_pixels:, :]
logger.debug(f"Split frame {width}x{height} into left: {left_eye.shape} and right: {right_eye.shape}")
return left_eye, right_eye
def split_video_into_eyes(self, input_video_path: str, left_output_path: str,
right_output_path: str, scale: float = 1.0) -> bool:
"""
Split a VR180 video into separate left and right eye videos using FFmpeg.
Args:
input_video_path: Path to input VR180 video
left_output_path: Output path for left eye video
right_output_path: Output path for right eye video
scale: Scale factor for output videos (default: 1.0)
Returns:
True if successful, False otherwise
"""
try:
# Get video properties
cap = cv2.VideoCapture(input_video_path)
if not cap.isOpened():
logger.error(f"Could not open video: {input_video_path}")
return False
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
cap.release()
# Calculate output dimensions
half_width = int((width // 2) * scale)
output_height = int(height * scale)
# Create output directories if they don't exist
os.makedirs(os.path.dirname(left_output_path), exist_ok=True)
os.makedirs(os.path.dirname(right_output_path), exist_ok=True)
# FFmpeg command for left eye (crop left half)
left_command = [
'ffmpeg', '-y',
'-i', input_video_path,
'-vf', f'crop={width//2 + self.eye_overlap_pixels}:{height}:0:0,scale={half_width}:{output_height}',
'-c:v', 'libx264',
'-preset', 'fast',
'-crf', '18',
left_output_path
]
# FFmpeg command for right eye (crop right half)
right_command = [
'ffmpeg', '-y',
'-i', input_video_path,
'-vf', f'crop={width//2 + self.eye_overlap_pixels}:{height}:{width//2 - self.eye_overlap_pixels}:0,scale={half_width}:{output_height}',
'-c:v', 'libx264',
'-preset', 'fast',
'-crf', '18',
right_output_path
]
logger.info(f"Splitting video into left eye: {left_output_path}")
result_left = subprocess.run(left_command, capture_output=True, text=True)
if result_left.returncode != 0:
logger.error(f"FFmpeg failed for left eye: {result_left.stderr}")
return False
logger.info(f"Splitting video into right eye: {right_output_path}")
result_right = subprocess.run(right_command, capture_output=True, text=True)
if result_right.returncode != 0:
logger.error(f"FFmpeg failed for right eye: {result_right.stderr}")
return False
logger.info(f"Successfully split video into separate eye videos")
return True
except Exception as e:
logger.error(f"Error splitting video into eyes: {e}")
return False
def combine_eye_masks(self, left_masks: Optional[Dict[int, np.ndarray]],
right_masks: Optional[Dict[int, np.ndarray]],
full_frame_shape: Tuple[int, int]) -> Dict[int, np.ndarray]:
"""
Combine left and right eye masks back into full-frame format.
Args:
left_masks: Dictionary of masks from left eye processing (frame_idx -> mask)
right_masks: Dictionary of masks from right eye processing (frame_idx -> mask)
full_frame_shape: Shape of the full VR180 frame (height, width)
Returns:
Dictionary of combined masks in full-frame format
"""
combined_masks = {}
full_height, full_width = full_frame_shape
half_width = full_width // 2
# Get all frame indices from both eyes
left_frames = set(left_masks.keys()) if left_masks else set()
right_frames = set(right_masks.keys()) if right_masks else set()
all_frames = left_frames.union(right_frames)
for frame_idx in all_frames:
# Create full-frame mask
combined_mask = np.zeros((full_height, full_width), dtype=np.uint8)
# Add left eye mask to left half of frame
if left_masks and frame_idx in left_masks:
left_mask = left_masks[frame_idx]
if len(left_mask.shape) == 3:
left_mask = left_mask.squeeze()
# Resize left mask to fit left half of full frame
left_target_width = half_width + self.eye_overlap_pixels
if left_mask.shape != (full_height, left_target_width):
left_mask = cv2.resize(left_mask.astype(np.uint8),
(left_target_width, full_height),
interpolation=cv2.INTER_NEAREST)
# Place in left half of combined mask
combined_mask[:, :left_target_width] = left_mask[:, :left_target_width]
# Add right eye mask to right half of frame
if right_masks and frame_idx in right_masks:
right_mask = right_masks[frame_idx]
if len(right_mask.shape) == 3:
right_mask = right_mask.squeeze()
# Resize right mask to fit right half of full frame
right_target_width = half_width + self.eye_overlap_pixels
right_start_x = half_width - self.eye_overlap_pixels
if right_mask.shape != (full_height, right_target_width):
right_mask = cv2.resize(right_mask.astype(np.uint8),
(right_target_width, full_height),
interpolation=cv2.INTER_NEAREST)
# Place in right half of combined mask
combined_mask[:, right_start_x:] = right_mask
# Store combined mask for this frame (using object ID 1 for simplicity)
combined_masks[frame_idx] = {1: combined_mask}
logger.debug(f"Combined {len(combined_masks)} frame masks from left/right eyes")
return combined_masks
def is_in_left_half(self, detection: Dict[str, Any], frame_width: int) -> bool:
"""
Check if a detection is in the left half of a VR180 frame.
Args:
detection: YOLO detection dictionary with 'bbox' key
frame_width: Width of the full VR180 frame
Returns:
True if detection center is in left half
"""
bbox = detection['bbox']
center_x = (bbox[0] + bbox[2]) / 2
return center_x < (frame_width // 2)
def is_in_right_half(self, detection: Dict[str, Any], frame_width: int) -> bool:
"""
Check if a detection is in the right half of a VR180 frame.
Args:
detection: YOLO detection dictionary with 'bbox' key
frame_width: Width of the full VR180 frame
Returns:
True if detection center is in right half
"""
return not self.is_in_left_half(detection, frame_width)
def convert_detection_to_eye_coordinates(self, detection: Dict[str, Any],
eye_side: str, frame_width: int) -> Dict[str, Any]:
"""
Convert a full-frame detection to eye-specific coordinates.
Args:
detection: YOLO detection dictionary with 'bbox' key
eye_side: 'left' or 'right'
frame_width: Width of the full VR180 frame
Returns:
Detection with converted coordinates for the specific eye
"""
bbox = detection['bbox'].copy()
half_width = frame_width // 2
if eye_side == 'right':
# Shift right eye coordinates to start from 0
bbox[0] -= (half_width - self.eye_overlap_pixels) # x1
bbox[2] -= (half_width - self.eye_overlap_pixels) # x2
# Ensure coordinates are within bounds
eye_width = half_width + self.eye_overlap_pixels
bbox[0] = max(0, min(bbox[0], eye_width - 1))
bbox[2] = max(0, min(bbox[2], eye_width - 1))
converted_detection = detection.copy()
converted_detection['bbox'] = bbox
return converted_detection
def create_full_greenscreen_frame(self, frame_shape: Tuple[int, int, int],
green_color: List[int] = [0, 255, 0]) -> np.ndarray:
"""
Create a full greenscreen frame for fallback when no humans are detected.
Args:
frame_shape: Shape of the frame (height, width, channels)
green_color: RGB values for green screen color
Returns:
Full greenscreen frame
"""
greenscreen_frame = np.full(frame_shape, green_color, dtype=np.uint8)
logger.debug(f"Created full greenscreen frame with shape {frame_shape}")
return greenscreen_frame

914
core/mask_processor.py Normal file
View File

@@ -0,0 +1,914 @@
"""
Mask processor module for applying green screen effects.
Handles applying masks to video frames to create green screen output.
"""
import os
import cv2
import numpy as np
import cupy as cp
import subprocess
import sys
import logging
from typing import Dict, List, Any, Optional, Tuple
from collections import deque
logger = logging.getLogger(__name__)
class MaskProcessor:
"""Handles mask application and green screen processing with quality enhancements."""
def __init__(self, green_color: List[int] = [0, 255, 0], blue_color: List[int] = [255, 0, 0],
mask_quality_config: Optional[Dict[str, Any]] = None,
output_mode: str = "green_screen"):
"""
Initialize mask processor with quality enhancement options.
Args:
green_color: RGB color for green screen background
blue_color: RGB color for second object (if needed)
mask_quality_config: Configuration dictionary for mask quality improvements
output_mode: Output mode - "green_screen" or "alpha_channel"
"""
self.green_color = green_color
self.blue_color = blue_color
self.output_mode = output_mode
self.use_gpu = self._check_gpu_availability()
# Mask quality configuration with defaults
if mask_quality_config is None:
mask_quality_config = {}
self.enable_edge_blur = mask_quality_config.get('enable_edge_blur', False)
self.edge_blur_radius = mask_quality_config.get('edge_blur_radius', 3)
self.edge_blur_sigma = mask_quality_config.get('edge_blur_sigma', 1.5)
self.enable_temporal_smoothing = mask_quality_config.get('enable_temporal_smoothing', False)
self.temporal_blend_weight = mask_quality_config.get('temporal_blend_weight', 0.3)
self.temporal_history_frames = mask_quality_config.get('temporal_history_frames', 3)
self.enable_morphological_cleaning = mask_quality_config.get('enable_morphological_cleaning', False)
self.morphology_kernel_size = mask_quality_config.get('morphology_kernel_size', 5)
self.min_component_size = mask_quality_config.get('min_component_size', 500)
self.alpha_blending_mode = mask_quality_config.get('alpha_blending_mode', 'gaussian')
self.alpha_transition_width = mask_quality_config.get('alpha_transition_width', 10)
self.enable_bilateral_filter = mask_quality_config.get('enable_bilateral_filter', False)
self.bilateral_d = mask_quality_config.get('bilateral_d', 9)
self.bilateral_sigma_color = mask_quality_config.get('bilateral_sigma_color', 75)
self.bilateral_sigma_space = mask_quality_config.get('bilateral_sigma_space', 75)
# Temporal history buffer for mask smoothing
self.mask_history = deque(maxlen=self.temporal_history_frames)
# Log configuration
if any([self.enable_edge_blur, self.enable_temporal_smoothing, self.enable_morphological_cleaning]):
logger.info("Mask quality enhancements enabled:")
if self.enable_edge_blur:
logger.info(f" Edge blur: radius={self.edge_blur_radius}, sigma={self.edge_blur_sigma}")
if self.enable_temporal_smoothing:
logger.info(f" Temporal smoothing: weight={self.temporal_blend_weight}, history={self.temporal_history_frames}")
if self.enable_morphological_cleaning:
logger.info(f" Morphological cleaning: kernel={self.morphology_kernel_size}, min_size={self.min_component_size}")
logger.info(f" Alpha blending: mode={self.alpha_blending_mode}, width={self.alpha_transition_width}")
else:
logger.info("Mask quality enhancements disabled - using standard binary masking")
logger.info(f"Output mode: {self.output_mode}")
def _check_gpu_availability(self) -> bool:
"""Check if CuPy GPU acceleration is available."""
try:
import cupy as cp
# Test GPU availability
test_array = cp.array([1, 2, 3])
_ = test_array * 2
logger.info("GPU acceleration available via CuPy")
return True
except Exception as e:
logger.warning(f"GPU acceleration not available, using CPU: {e}")
return False
def enhance_mask_quality(self, mask: np.ndarray) -> np.ndarray:
"""
Apply all enabled mask quality enhancements.
Args:
mask: Input binary mask
Returns:
Enhanced mask with quality improvements applied
"""
enhanced_mask = mask.copy()
# 1. Morphological cleaning
if self.enable_morphological_cleaning:
enhanced_mask = self._clean_mask_morphologically(enhanced_mask)
# 2. Temporal smoothing
if self.enable_temporal_smoothing:
enhanced_mask = self._apply_temporal_smoothing(enhanced_mask)
# 3. Edge enhancement and blurring
if self.enable_edge_blur:
enhanced_mask = self._apply_edge_blur(enhanced_mask)
# 4. Bilateral filtering (if enabled)
if self.enable_bilateral_filter:
enhanced_mask = self._apply_bilateral_filter(enhanced_mask)
return enhanced_mask
def _clean_mask_morphologically(self, mask: np.ndarray) -> np.ndarray:
"""
Clean mask using morphological operations to remove noise and small artifacts.
Args:
mask: Input binary mask
Returns:
Cleaned mask
"""
# Convert to uint8 for OpenCV operations
mask_uint8 = (mask * 255).astype(np.uint8)
# Create morphological kernel
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE,
(self.morphology_kernel_size, self.morphology_kernel_size))
# Opening operation (erosion followed by dilation) to remove small noise
cleaned = cv2.morphologyEx(mask_uint8, cv2.MORPH_OPEN, kernel)
# Closing operation (dilation followed by erosion) to fill small holes
cleaned = cv2.morphologyEx(cleaned, cv2.MORPH_CLOSE, kernel)
# Remove small connected components
if self.min_component_size > 0:
cleaned = self._remove_small_components(cleaned)
return (cleaned / 255.0).astype(np.float32)
def _remove_small_components(self, mask: np.ndarray) -> np.ndarray:
"""
Remove connected components smaller than minimum size.
Args:
mask: Input binary mask (uint8)
Returns:
Mask with small components removed
"""
# Find connected components
num_labels, labels, stats, centroids = cv2.connectedComponentsWithStats(mask, connectivity=8)
# Create output mask
output_mask = np.zeros_like(mask)
# Keep components larger than minimum size (skip background label 0)
for i in range(1, num_labels):
component_size = stats[i, cv2.CC_STAT_AREA]
if component_size >= self.min_component_size:
output_mask[labels == i] = 255
return output_mask
def _apply_temporal_smoothing(self, mask: np.ndarray) -> np.ndarray:
"""
Apply temporal smoothing using mask history.
Args:
mask: Current frame mask
Returns:
Temporally smoothed mask
"""
if len(self.mask_history) == 0:
# First frame, no history to blend with
self.mask_history.append(mask.copy())
return mask
# Blend with previous frames using weighted average
smoothed_mask = mask.astype(np.float32)
total_weight = 1.0
for i, hist_mask in enumerate(reversed(self.mask_history)):
# Exponential decay: more recent frames have higher weight
frame_weight = self.temporal_blend_weight * (0.8 ** i)
smoothed_mask += hist_mask.astype(np.float32) * frame_weight
total_weight += frame_weight
# Normalize by total weight
smoothed_mask /= total_weight
# Update history
self.mask_history.append(mask.copy())
return smoothed_mask
def _apply_edge_blur(self, mask: np.ndarray) -> np.ndarray:
"""
Apply Gaussian blur to mask edges for smooth transitions.
Args:
mask: Input mask
Returns:
Mask with blurred edges
"""
# Apply Gaussian blur
kernel_size = 2 * self.edge_blur_radius + 1
blurred_mask = cv2.GaussianBlur(mask.astype(np.float32),
(kernel_size, kernel_size),
self.edge_blur_sigma)
return blurred_mask
def _apply_bilateral_filter(self, mask: np.ndarray) -> np.ndarray:
"""
Apply bilateral filtering for edge-preserving smoothing.
Args:
mask: Input mask
Returns:
Filtered mask
"""
# Convert to uint8 for bilateral filter
mask_uint8 = (mask * 255).astype(np.uint8)
# Apply bilateral filter
filtered = cv2.bilateralFilter(mask_uint8, self.bilateral_d,
self.bilateral_sigma_color,
self.bilateral_sigma_space)
return (filtered / 255.0).astype(np.float32)
def _create_alpha_mask(self, mask: np.ndarray) -> np.ndarray:
"""
Create alpha mask with smooth transitions based on blending mode.
Args:
mask: Input binary/float mask
Returns:
Alpha mask with smooth transitions
"""
if self.alpha_blending_mode == "linear":
return mask
elif self.alpha_blending_mode == "gaussian":
# Use distance transform for smooth falloff
binary_mask = (mask > 0.5).astype(np.uint8)
# Distance transform from mask edges
dist_inside = cv2.distanceTransform(binary_mask, cv2.DIST_L2, 5)
dist_outside = cv2.distanceTransform(1 - binary_mask, cv2.DIST_L2, 5)
# Create smooth alpha based on distance
alpha = np.zeros_like(mask, dtype=np.float32)
transition_width = self.alpha_transition_width
# Inside mask: fade from edge
alpha[binary_mask > 0] = np.minimum(1.0, dist_inside[binary_mask > 0] / transition_width)
# Outside mask: fade to zero
alpha[binary_mask == 0] = np.maximum(0.0, 1.0 - dist_outside[binary_mask == 0] / transition_width)
return alpha
elif self.alpha_blending_mode == "sigmoid":
# Sigmoid-based smooth transition
return 1.0 / (1.0 + np.exp(-10 * (mask - 0.5)))
else:
return mask
def apply_green_mask(self, frame: np.ndarray, masks: List[np.ndarray]) -> np.ndarray:
"""
Apply green screen mask to a frame with quality enhancements.
Args:
frame: Input video frame (BGR format)
masks: List of object masks to apply
Returns:
Frame with green screen background and enhanced mask quality
"""
# Combine all masks into a single mask
combined_mask = self._combine_masks(masks)
# Apply quality enhancements
enhanced_mask = self.enhance_mask_quality(combined_mask)
# Create alpha mask for smooth blending
alpha_mask = self._create_alpha_mask(enhanced_mask)
# Apply mask using alpha blending
if self.use_gpu:
return self._apply_green_mask_gpu_enhanced(frame, alpha_mask)
else:
return self._apply_green_mask_cpu_enhanced(frame, alpha_mask)
def apply_mask_with_alpha(self, frame: np.ndarray, masks: List[np.ndarray]) -> np.ndarray:
"""
Apply mask to create RGBA frame with alpha channel.
Args:
frame: Input video frame (BGR format)
masks: List of object masks to apply
Returns:
RGBA frame with alpha channel
"""
# Combine all masks into a single mask
combined_mask = self._combine_masks(masks)
# Apply quality enhancements
enhanced_mask = self.enhance_mask_quality(combined_mask)
# Create alpha mask for smooth blending
alpha_mask = self._create_alpha_mask(enhanced_mask)
# Resize alpha mask to match frame if needed
if alpha_mask.shape != frame.shape[:2]:
alpha_mask = cv2.resize(alpha_mask, (frame.shape[1], frame.shape[0]))
# Convert BGR to BGRA
bgra_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2BGRA)
# Set alpha channel
bgra_frame[:, :, 3] = (alpha_mask * 255).astype(np.uint8)
return bgra_frame
def _combine_masks(self, masks: List[np.ndarray]) -> np.ndarray:
"""
Combine multiple object masks into a single mask.
Args:
masks: List of object masks
Returns:
Combined mask
"""
if not masks:
return np.zeros((0, 0), dtype=np.float32)
# Start with first mask
combined_mask = masks[0].squeeze().astype(np.float32)
# Combine with remaining masks using logical OR
for mask in masks[1:]:
mask_squeezed = mask.squeeze().astype(np.float32)
if mask_squeezed.shape != combined_mask.shape:
# Resize mask to match combined mask
mask_squeezed = cv2.resize(mask_squeezed,
(combined_mask.shape[1], combined_mask.shape[0]),
interpolation=cv2.INTER_NEAREST)
combined_mask = np.maximum(combined_mask, mask_squeezed)
return combined_mask
def reset_temporal_history(self):
"""Reset temporal history buffer. Call this when starting a new segment."""
self.mask_history.clear()
logger.debug("Temporal history buffer reset")
def _apply_green_mask_gpu_enhanced(self, frame: np.ndarray, alpha_mask: np.ndarray) -> np.ndarray:
"""GPU-accelerated green mask application with alpha blending using CuPy (Phase 1 optimized)."""
try:
# Convert to CuPy arrays with optimized data transfer
frame_gpu = cp.asarray(frame, dtype=cp.uint8)
alpha_gpu = cp.asarray(alpha_mask, dtype=cp.float32)
# Resize alpha mask to match frame if needed (vectorized operation)
if alpha_gpu.shape != frame_gpu.shape[:2]:
# Use CuPy's resize instead of OpenCV for GPU optimization
alpha_gpu = cp.array(cv2.resize(cp.asnumpy(alpha_gpu),
(frame_gpu.shape[1], frame_gpu.shape[0])))
# Create green background (optimized broadcasting)
green_color_gpu = cp.array(self.green_color, dtype=cp.uint8)
green_background = cp.broadcast_to(green_color_gpu, frame_gpu.shape)
# Apply vectorized alpha blending with optimized memory access
alpha_3d = cp.expand_dims(alpha_gpu, axis=2)
# Use more efficient computation with explicit typing
frame_float = frame_gpu.astype(cp.float32)
green_float = green_background.astype(cp.float32)
# Vectorized blending operation
result_frame = cp.clip(alpha_3d * frame_float + (1.0 - alpha_3d) * green_float, 0, 255)
return cp.asnumpy(result_frame.astype(cp.uint8))
except Exception as e:
logger.error(f"GPU enhanced processing failed, falling back to CPU: {e}")
return self._apply_green_mask_cpu_enhanced(frame, alpha_mask)
def _apply_green_mask_cpu_enhanced(self, frame: np.ndarray, alpha_mask: np.ndarray) -> np.ndarray:
"""CPU-based green mask application with alpha blending (Phase 1 optimized)."""
# Resize alpha mask to match frame if needed
if alpha_mask.shape != frame.shape[:2]:
alpha_mask = cv2.resize(alpha_mask, (frame.shape[1], frame.shape[0]))
# Create green background with broadcasting (more efficient)
green_color = np.array(self.green_color, dtype=np.uint8)
green_background = np.broadcast_to(green_color, frame.shape)
# Apply optimized alpha blending with explicit data types
alpha_3d = np.expand_dims(alpha_mask.astype(np.float32), axis=2)
# Vectorized blending with optimized memory access
frame_float = frame.astype(np.float32)
green_float = green_background.astype(np.float32)
result_frame = np.clip(alpha_3d * frame_float + (1.0 - alpha_3d) * green_float, 0, 255)
return result_frame.astype(np.uint8)
def apply_colored_mask(self, frame: np.ndarray, masks_a: List[np.ndarray],
masks_b: List[np.ndarray]) -> np.ndarray:
"""
Apply colored masks for visualization (green and blue).
Args:
frame: Input video frame
masks_a: Masks for object A (green)
masks_b: Masks for object B (blue)
Returns:
Frame with colored masks applied
"""
colored_mask = np.zeros_like(frame)
# Apply green color to masks_a
for mask in masks_a:
mask = mask.squeeze()
if mask.shape != frame.shape[:2]:
mask = cv2.resize(mask, (frame.shape[1], frame.shape[0]),
interpolation=cv2.INTER_NEAREST)
colored_mask[mask > 0] = self.green_color
# Apply blue color to masks_b
for mask in masks_b:
mask = mask.squeeze()
if mask.shape != frame.shape[:2]:
mask = cv2.resize(mask, (frame.shape[1], frame.shape[0]),
interpolation=cv2.INTER_NEAREST)
colored_mask[mask > 0] = self.blue_color
return colored_mask
def process_and_save_output_video(self, video_path: str, output_video_path: str,
video_segments: Dict[int, Dict[int, np.ndarray]],
use_nvenc: bool = False, bitrate: str = "50M",
batch_size: int = 16) -> bool:
"""
Process high-resolution frames, apply upscaled masks, and save the output video.
Args:
video_path: Path to input video
output_video_path: Path to save output video
video_segments: Dictionary of frame masks
use_nvenc: Whether to use NVIDIA hardware encoding
bitrate: Output video bitrate
batch_size: Number of frames to process in a single batch
Returns:
True if successful
"""
try:
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
logger.error(f"Could not open video: {video_path}")
return False
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
logger.info(f"Processing video: {frame_width}x{frame_height} @ {fps}fps, {total_frames} frames")
# Setup VideoWriter
out_writer = None
if self.output_mode == "alpha_channel":
success = self._setup_alpha_encoder(output_video_path, frame_width, frame_height, fps, bitrate)
if not success:
logger.error("Failed to setup alpha channel encoder")
cap.release()
return False
use_nvenc = False
elif use_nvenc:
success = self._setup_nvenc_encoder(output_video_path, frame_width, frame_height, fps, bitrate)
if not success:
logger.warning("NVENC setup failed, falling back to OpenCV")
use_nvenc = False
if not use_nvenc and self.output_mode != "alpha_channel":
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out_writer = cv2.VideoWriter(output_video_path, fourcc, fps, (frame_width, frame_height))
if not out_writer.isOpened():
logger.error("Failed to create output video writer")
cap.release()
return False
# Process frames in batches
frame_idx = 0
processed_frames = 0
while frame_idx < total_frames:
batch_frames = []
batch_masks = []
# Read a batch of frames
for _ in range(batch_size):
ret, frame = cap.read()
if not ret:
break
batch_frames.append(frame)
if not batch_frames:
break
# Get masks for the current batch and perform just-in-time upscaling
for i in range(len(batch_frames)):
current_frame_idx = frame_idx + i
if current_frame_idx in video_segments:
frame_masks = video_segments[current_frame_idx]
upscaled_masks = []
for obj_id, mask in frame_masks.items():
mask = mask.squeeze()
if mask.shape != (frame_height, frame_width):
upscaled_mask = cv2.resize(mask.astype(np.uint8),
(frame_width, frame_height),
interpolation=cv2.INTER_NEAREST)
upscaled_masks.append(upscaled_mask)
else:
upscaled_masks.append(mask.astype(np.uint8))
batch_masks.append(upscaled_masks)
else:
batch_masks.append([]) # No masks for this frame
# Process the batch
result_batch = []
for i, frame in enumerate(batch_frames):
masks = batch_masks[i]
if masks:
if self.output_mode == "alpha_channel":
result_frame = self.apply_mask_with_alpha(frame, masks)
else:
result_frame = self.apply_green_mask(frame, masks)
else:
# No mask for this frame
if self.output_mode == "alpha_channel":
bgra_frame = cv2.cvtColor(frame, cv2.COLOR_BGR2BGRA)
bgra_frame[:, :, 3] = 0
result_frame = bgra_frame
else:
result_frame = frame
result_batch.append(result_frame)
# Write the processed batch
for result_frame in result_batch:
if self.output_mode == "alpha_channel" and hasattr(self, 'alpha_process'):
self.alpha_process.stdin.write(result_frame.tobytes())
elif use_nvenc and hasattr(self, 'nvenc_process'):
self.nvenc_process.stdin.write(result_frame.tobytes())
else:
out_writer.write(result_frame)
processed_frames += len(batch_frames)
frame_idx += len(batch_frames)
if processed_frames % 100 < batch_size:
logger.info(f"Processed {processed_frames}/{total_frames} frames")
# Cleanup
cap.release()
if self.output_mode == "alpha_channel" and hasattr(self, 'alpha_process'):
self.alpha_process.stdin.close()
self.alpha_process.wait()
elif use_nvenc and hasattr(self, 'nvenc_process'):
self.nvenc_process.stdin.close()
self.nvenc_process.wait()
else:
if out_writer:
out_writer.release()
logger.info(f"Successfully processed {processed_frames} frames to {output_video_path}")
return True
except Exception as e:
logger.error(f"Error processing video: {e}", exc_info=True)
return False
def _setup_nvenc_encoder(self, output_path: str, width: int, height: int,
fps: float, bitrate: str) -> bool:
"""Setup NVENC hardware encoder using FFmpeg."""
try:
# Determine encoder based on platform
if sys.platform == 'darwin':
encoder = 'hevc_videotoolbox'
else:
encoder = 'hevc_nvenc'
command = [
'ffmpeg',
'-y', # Overwrite output file
'-f', 'rawvideo',
'-vcodec', 'rawvideo',
'-pix_fmt', 'bgr24',
'-s', f'{width}x{height}',
'-r', str(fps),
'-i', '-', # Input from stdin
'-an', # No audio (will be added later)
'-vcodec', encoder,
'-pix_fmt', 'yuv420p', # Changed from nv12 for better compatibility
'-preset', 'slow',
'-b:v', bitrate,
output_path
]
self.nvenc_process = subprocess.Popen(command, stdin=subprocess.PIPE,
stderr=subprocess.PIPE)
logger.info(f"Initialized {encoder} hardware encoder")
return True
except Exception as e:
logger.error(f"Failed to setup NVENC encoder: {e}")
return False
def _setup_alpha_encoder(self, output_path: str, width: int, height: int,
fps: float, bitrate: str) -> bool:
"""Setup encoder for alpha channel video using FFmpeg with H.264/H.265."""
try:
# For VR180 SBS, we'll use H.265 (HEVC) with alpha channel
# Note: Standard H.264/H.265 don't support alpha directly,
# so we'll encode the alpha as a separate grayscale channel or use a special pixel format
# Determine encoder based on platform
if sys.platform == 'darwin':
encoder = 'hevc_videotoolbox'
else:
encoder = 'hevc_nvenc'
command = [
'ffmpeg',
'-y', # Overwrite output file
'-f', 'rawvideo',
'-vcodec', 'rawvideo',
'-pix_fmt', 'bgra', # BGRA for alpha channel
'-s', f'{width}x{height}',
'-r', str(fps),
'-i', '-', # Input from stdin
'-an', # No audio (will be added later)
'-c:v', encoder,
'-pix_fmt', 'yuv420p', # Standard pixel format
'-preset', 'slow',
'-b:v', bitrate,
'-tag:v', 'hvc1', # Required for some players
output_path
]
self.alpha_process = subprocess.Popen(command, stdin=subprocess.PIPE,
stderr=subprocess.PIPE)
self.alpha_output_path = output_path
logger.info(f"Initialized {encoder} for alpha channel output (will be encoded as transparency in RGB)")
return True
except Exception as e:
logger.error(f"Failed to setup alpha encoder: {e}")
return False
def process_segment(self, segment_info: dict, video_segments: Dict[int, Dict[int, np.ndarray]],
use_nvenc: bool = False, bitrate: str = "50M") -> bool:
"""
Process a single segment and save the output video.
Args:
segment_info: Segment information dictionary
video_segments: Dictionary of frame masks from SAM2
use_nvenc: Whether to use hardware encoding
bitrate: Output video bitrate
Returns:
True if successful
"""
input_video = segment_info['video_file']
if self.output_mode == "alpha_channel":
output_video = os.path.join(segment_info['directory'], f"output_{segment_info['index']}.mov")
else:
output_video = os.path.join(segment_info['directory'], f"output_{segment_info['index']}.mp4")
logger.info(f"Processing segment {segment_info['index']} with {self.output_mode}")
success = self.process_and_save_output_video(
input_video,
output_video,
video_segments,
use_nvenc,
bitrate
)
if success:
logger.info(f"Successfully created {self.output_mode} video: {output_video}")
# Mark segment as completed only after video is successfully written
try:
output_done_file = os.path.join(segment_info['directory'], "output_frames_done")
with open(output_done_file, 'w') as f:
f.write(f"Segment {segment_info['index']} processed and saved successfully.")
logger.debug(f"Created completion marker for segment {segment_info['index']}")
except Exception as e:
logger.error(f"Failed to create completion marker for segment {segment_info['index']}: {e}")
else:
logger.error(f"Failed to process segment {segment_info['index']}")
return success
def create_full_greenscreen_frame(self, frame_shape: Tuple[int, int, int],
green_color: Optional[List[int]] = None) -> np.ndarray:
"""
Create a full greenscreen frame for fallback when no humans are detected.
Args:
frame_shape: Shape of the frame (height, width, channels)
green_color: RGB values for green screen color (uses default if None)
Returns:
Full greenscreen frame
"""
if green_color is None:
green_color = self.green_color
greenscreen_frame = np.full(frame_shape, green_color, dtype=np.uint8)
logger.debug(f"Created full greenscreen frame with shape {frame_shape}")
return greenscreen_frame
def process_greenscreen_only_segment(self, segment_info: dict,
green_color: Optional[List[int]] = None,
use_nvenc: bool = False, bitrate: str = "50M") -> bool:
"""
Create a full greenscreen segment when no humans are detected.
Used as fallback in separate eye processing mode.
Args:
segment_info: Segment information dictionary
green_color: RGB values for green screen color (uses default if None)
use_nvenc: Whether to use hardware encoding
bitrate: Output video bitrate
Returns:
True if greenscreen segment was created successfully
"""
segment_dir = segment_info['directory']
video_path = segment_info['video_file']
segment_idx = segment_info['index']
logger.info(f"Creating full greenscreen segment {segment_idx} (no humans detected)")
try:
# Get video properties
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
logger.error(f"Could not open video: {video_path}")
return False
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.release()
# Create output video path
if self.output_mode == "alpha_channel":
output_video_path = os.path.join(segment_dir, f"output_{segment_idx}.mov")
else:
output_video_path = os.path.join(segment_dir, f"output_{segment_idx}.mp4")
# Create greenscreen frame
if green_color is None:
green_color = self.green_color
greenscreen_frame = self.create_full_greenscreen_frame(
(height, width, 3), green_color
)
# Setup video writer based on mode and hardware encoding preference
if use_nvenc:
success = self._write_greenscreen_with_nvenc(
output_video_path, greenscreen_frame, frame_count, fps, bitrate
)
else:
success = self._write_greenscreen_with_opencv(
output_video_path, greenscreen_frame, frame_count, fps
)
if not success:
logger.error(f"Failed to write greenscreen video for segment {segment_idx}")
return False
# Create empty mask file (black mask since no humans detected)
mask_output_path = os.path.join(segment_dir, "mask.png")
black_mask = np.zeros((height, width, 3), dtype=np.uint8)
cv2.imwrite(mask_output_path, black_mask)
# Mark segment as completed
output_done_file = os.path.join(segment_dir, "output_frames_done")
with open(output_done_file, 'w') as f:
f.write(f"Greenscreen segment {segment_idx} completed successfully\n")
logger.info(f"Successfully created greenscreen segment {segment_idx}")
return True
except Exception as e:
logger.error(f"Error creating greenscreen segment {segment_idx}: {e}")
return False
def _write_greenscreen_with_opencv(self, output_path: str, greenscreen_frame: np.ndarray,
frame_count: int, fps: float) -> bool:
"""Write greenscreen video using OpenCV VideoWriter."""
try:
if self.output_mode == "alpha_channel":
# For alpha channel mode, create fully transparent frames
bgra_frame = cv2.cvtColor(greenscreen_frame, cv2.COLOR_BGR2BGRA)
bgra_frame[:, :, 3] = 0 # Fully transparent
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_path, fourcc, fps,
(greenscreen_frame.shape[1], greenscreen_frame.shape[0]), True)
frame_to_write = bgra_frame[:, :, :3] # OpenCV expects BGR for mp4v
else:
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_path, fourcc, fps,
(greenscreen_frame.shape[1], greenscreen_frame.shape[0]))
frame_to_write = greenscreen_frame
if not out.isOpened():
logger.error(f"Failed to open video writer for {output_path}")
return False
# Write identical greenscreen frames
for _ in range(frame_count):
out.write(frame_to_write)
out.release()
logger.debug(f"Wrote {frame_count} greenscreen frames using OpenCV")
return True
except Exception as e:
logger.error(f"Error writing greenscreen with OpenCV: {e}")
return False
def _write_greenscreen_with_nvenc(self, output_path: str, greenscreen_frame: np.ndarray,
frame_count: int, fps: float, bitrate: str) -> bool:
"""Write greenscreen video using NVENC hardware encoding."""
try:
# Setup NVENC encoder
if not self._setup_nvenc_encoder(output_path,
greenscreen_frame.shape[1],
greenscreen_frame.shape[0],
fps, bitrate):
logger.warning("NVENC setup failed for greenscreen, falling back to OpenCV")
return self._write_greenscreen_with_opencv(output_path, greenscreen_frame, frame_count, fps)
# Write identical greenscreen frames
for _ in range(frame_count):
self.nvenc_process.stdin.write(greenscreen_frame.tobytes())
# Finalize encoding
self.nvenc_process.stdin.close()
self.nvenc_process.wait()
if self.nvenc_process.returncode != 0:
logger.error("NVENC encoding failed for greenscreen")
return False
logger.debug(f"Wrote {frame_count} greenscreen frames using NVENC")
return True
except Exception as e:
logger.error(f"Error writing greenscreen with NVENC: {e}")
return False
def has_valid_masks(self, video_segments: Optional[Dict[int, Dict[int, np.ndarray]]]) -> bool:
"""
Check if video segments contain valid masks.
Args:
video_segments: Video segments dictionary from SAM2
Returns:
True if valid masks are found
"""
if not video_segments:
return False
# Check if any frame has non-empty masks
for frame_idx, frame_masks in video_segments.items():
for obj_id, mask in frame_masks.items():
if mask is not None and np.any(mask):
return True
return False

View File

@@ -8,16 +8,20 @@ import cv2
import numpy as np
import torch
import logging
import subprocess
import gc
from typing import Dict, List, Any, Optional, Tuple
from sam2.build_sam import build_sam2_video_predictor
from .eye_processor import EyeProcessor
logger = logging.getLogger(__name__)
class SAM2Processor:
"""Handles SAM2-based video segmentation for human tracking."""
def __init__(self, checkpoint_path: str, config_path: str, vos_optimized: bool = False):
def __init__(self, checkpoint_path: str, config_path: str, vos_optimized: bool = False,
separate_eye_processing: bool = False, eye_overlap_pixels: int = 0,
async_preprocessor=None):
"""
Initialize SAM2 processor.
@@ -25,11 +29,23 @@ class SAM2Processor:
checkpoint_path: Path to SAM2 checkpoint
config_path: Path to SAM2 config file
vos_optimized: Enable VOS optimization for speedup (requires PyTorch 2.5.1+)
separate_eye_processing: Enable VR180 separate eye processing mode
eye_overlap_pixels: Pixel overlap between eyes for blending
async_preprocessor: Optional async preprocessor for background low-res video generation
"""
self.checkpoint_path = checkpoint_path
self.config_path = config_path
self.vos_optimized = vos_optimized
self.separate_eye_processing = separate_eye_processing
self.async_preprocessor = async_preprocessor
self.predictor = None
# Initialize eye processor if separate eye processing is enabled
if separate_eye_processing:
self.eye_processor = EyeProcessor(eye_overlap_pixels=eye_overlap_pixels)
else:
self.eye_processor = None
self._initialize_predictor()
def _initialize_predictor(self):
@@ -108,13 +124,64 @@ class SAM2Processor:
def create_low_res_video(self, input_video_path: str, output_video_path: str, scale: float):
"""
Create a low-resolution version of the input video for inference.
Create a low-resolution version of the input video for inference using FFmpeg
with hardware acceleration for improved performance.
Args:
input_video_path: Path to input video
output_video_path: Path to output low-res video
scale: Scale factor for resolution reduction
"""
try:
# Get video properties using OpenCV
cap = cv2.VideoCapture(input_video_path)
if not cap.isOpened():
raise ValueError(f"Could not open video: {input_video_path}")
original_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
original_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS) or 30.0
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.release()
target_width = int(original_width * scale)
target_height = int(original_height * scale)
# Ensure dimensions are even, as required by many codecs
target_width = target_width if target_width % 2 == 0 else target_width + 1
target_height = target_height if target_height % 2 == 0 else target_height + 1
# Construct FFmpeg command with hardware acceleration
command = [
'ffmpeg',
'-y',
'-hwaccel', 'auto', # Auto-detect hardware acceleration
'-i', input_video_path,
'-vf', f'scale={target_width}:{target_height}',
'-c:v', 'h264_nvenc', # Use NVIDIA's hardware encoder
'-preset', 'fast',
'-crf', '23',
output_video_path
]
logger.info(f"Executing FFmpeg command: {' '.join(command)}")
# Execute FFmpeg command
process = subprocess.run(command, check=True, capture_output=True, text=True)
if process.returncode != 0:
logger.error(f"FFmpeg failed with error: {process.stderr}")
raise RuntimeError(f"FFmpeg process failed: {process.stderr}")
logger.info(f"Created low-res video with {frame_count} frames: {output_video_path}")
except (subprocess.CalledProcessError, FileNotFoundError) as e:
logger.warning(f"Hardware-accelerated FFmpeg failed: {e}. Falling back to OpenCV.")
# Fallback to original OpenCV implementation if FFmpeg fails
self._create_low_res_video_opencv(input_video_path, output_video_path, scale)
def _create_low_res_video_opencv(self, input_video_path: str, output_video_path: str, scale: float):
"""Original OpenCV-based implementation for creating low-resolution video."""
cap = cv2.VideoCapture(input_video_path)
if not cap.isOpened():
raise ValueError(f"Could not open video: {input_video_path}")
@@ -139,9 +206,52 @@ class SAM2Processor:
cap.release()
out.release()
logger.info(f"Created low-res video with {frame_count} frames: {output_video_path}")
logger.info(f"Created low-res video with {frame_count} frames using OpenCV: {output_video_path}")
def add_yolo_prompts_to_predictor(self, inference_state, prompts: List[Dict[str, Any]]) -> bool:
def ensure_low_res_video(self, input_video_path: str, output_video_path: str,
scale: float, segment_idx: Optional[int] = None) -> bool:
"""
Ensure low-resolution video exists, using async preprocessor if available.
Args:
input_video_path: Path to input video
output_video_path: Path to output low-res video
scale: Scale factor for resolution reduction
segment_idx: Optional segment index for async coordination
Returns:
True if low-res video is ready
"""
# Check if already exists
if os.path.exists(output_video_path) and os.path.getsize(output_video_path) > 0:
return True
# Use async preprocessor if available and segment index provided
if self.async_preprocessor and segment_idx is not None:
if self.async_preprocessor.is_segment_ready(segment_idx):
if os.path.exists(output_video_path) and os.path.getsize(output_video_path) > 0:
logger.debug(f"Async preprocessor provided segment {segment_idx}")
return True
else:
logger.debug(f"Async preprocessor hasn't completed segment {segment_idx} yet")
# Fallback to synchronous creation
try:
logger.info(f"Creating low-res video synchronously: {input_video_path} -> {output_video_path}")
self.create_low_res_video(input_video_path, output_video_path, scale)
if os.path.exists(output_video_path) and os.path.getsize(output_video_path) > 0:
logger.info(f"Successfully created low-res video: {output_video_path} ({os.path.getsize(output_video_path)} bytes)")
return True
else:
logger.error(f"Low-res video creation failed - file doesn't exist or is empty: {output_video_path}")
return False
except Exception as e:
logger.error(f"Failed to create low-res video {output_video_path}: {e}")
return False
def add_yolo_prompts_to_predictor(self, inference_state, prompts: List[Dict[str, Any]],
inference_scale: float = 1.0) -> bool:
"""
Add YOLO detection prompts to SAM2 predictor.
Includes error handling matching the working spec.md implementation.
@@ -149,6 +259,7 @@ class SAM2Processor:
Args:
inference_state: SAM2 inference state
prompts: List of prompt dictionaries with obj_id and bbox
inference_scale: Scale factor to apply to bounding boxes
Returns:
True if prompts were added successfully
@@ -166,14 +277,20 @@ class SAM2Processor:
bbox = prompt['bbox']
confidence = prompt.get('confidence', 'unknown')
logger.info(f"SAM2 Debug: Adding prompt {i+1}/{len(prompts)}: Object {obj_id}, bbox={bbox}, conf={confidence}")
# Scale bounding box for SAM2 inference resolution
scaled_bbox = bbox * inference_scale
logger.info(f"SAM2 Debug: Adding prompt {i+1}/{len(prompts)}: Object {obj_id}")
logger.info(f" Original bbox: {bbox}")
logger.info(f" Scaled bbox (scale={inference_scale}): {scaled_bbox}")
logger.info(f" Confidence: {confidence}")
try:
_, out_obj_ids, out_mask_logits = self.predictor.add_new_points_or_box(
inference_state=inference_state,
frame_idx=0,
obj_id=obj_id,
box=bbox.astype(np.float32),
box=scaled_bbox.astype(np.float32),
)
logger.info(f"SAM2 Debug: ✓ Successfully added Object {obj_id} - returned obj_ids: {out_obj_ids}")
@@ -329,14 +446,11 @@ class SAM2Processor:
logger.info(f"Processing segment {segment_idx} with SAM2")
# Create low-resolution video for inference
# Create low-resolution video for inference (async-aware)
low_res_video_path = os.path.join(segment_dir, "low_res_video.mp4")
if not os.path.exists(low_res_video_path):
try:
self.create_low_res_video(video_path, low_res_video_path, inference_scale)
except Exception as e:
logger.error(f"Failed to create low-res video for segment {segment_idx}: {e}")
return None
if not self.ensure_low_res_video(video_path, low_res_video_path, inference_scale, segment_idx):
logger.error(f"Failed to create low-res video for segment {segment_idx}")
return None
try:
# Initialize inference state
@@ -344,7 +458,7 @@ class SAM2Processor:
# Add prompts or previous masks
if yolo_prompts:
if not self.add_yolo_prompts_to_predictor(inference_state, yolo_prompts):
if not self.add_yolo_prompts_to_predictor(inference_state, yolo_prompts, inference_scale):
return None
elif previous_masks:
if not self.add_previous_masks_to_predictor(inference_state, previous_masks):
@@ -375,13 +489,7 @@ class SAM2Processor:
except Exception as e:
logger.warning(f"Could not remove low-res video: {e}")
# Mark segment as completed (for resume capability)
try:
with open(output_done_file, 'w') as f:
f.write(f"Segment {segment_idx} completed successfully\n")
logger.debug(f"Marked segment {segment_idx} as completed")
except Exception as e:
logger.warning(f"Could not create completion marker: {e}")
return video_segments
@@ -490,7 +598,7 @@ class SAM2Processor:
inference_state = self.predictor.init_state(video_path=temp_video_path, async_loading_frames=True)
# Add prompts
if not self.add_yolo_prompts_to_predictor(inference_state, prompts):
if not self.add_yolo_prompts_to_predictor(inference_state, prompts, inference_scale):
logger.error("Failed to add prompts for first frame debug")
return False
@@ -650,3 +758,250 @@ class SAM2Processor:
else:
logger.error("SAM2 Mid-segment: FAILED - No prompts were successfully added")
return False
def process_single_eye_segment(self, segment_info: dict, eye_side: str,
yolo_prompts: Optional[List[Dict[str, Any]]] = None,
previous_masks: Optional[Dict[int, np.ndarray]] = None,
inference_scale: float = 0.5) -> Optional[Dict[int, np.ndarray]]:
"""
Process a single eye of a VR180 segment with SAM2.
Args:
segment_info: Segment information dictionary
eye_side: 'left' or 'right' eye
yolo_prompts: Optional YOLO detection prompts for first frame
previous_masks: Optional masks from previous segment
inference_scale: Scale factor for inference
Returns:
Dictionary mapping frame indices to masks, or None if failed
"""
if not self.eye_processor:
logger.error("Eye processor not initialized - separate_eye_processing must be enabled")
return None
segment_dir = segment_info['directory']
video_path = segment_info['video_file']
segment_idx = segment_info['index']
logger.info(f"Processing {eye_side} eye for segment {segment_idx}")
# Use the video path directly (it should already be the eye-specific video)
eye_video_path = video_path
# Verify the eye video exists
if not os.path.exists(eye_video_path):
logger.error(f"Eye video not found: {eye_video_path}")
return None
# Create low-resolution eye video for inference (async-aware)
low_res_eye_video_path = os.path.join(segment_dir, f"low_res_{eye_side}_eye_video.mp4")
if not self.ensure_low_res_video(eye_video_path, low_res_eye_video_path, inference_scale, segment_idx):
logger.error(f"Failed to create low-res {eye_side} eye video for segment {segment_idx}")
return None
try:
# Initialize inference state with eye-specific video
inference_state = self.predictor.init_state(video_path=low_res_eye_video_path, async_loading_frames=True)
# Add prompts or previous masks (always use obj_id=1 for single eye processing)
if yolo_prompts:
# Convert prompts to use obj_id=1 for single eye processing
eye_prompts = []
for prompt in yolo_prompts:
eye_prompt = prompt.copy()
eye_prompt['obj_id'] = 1 # Always use obj_id=1 for single eye
eye_prompts.append(eye_prompt)
if not self.add_yolo_prompts_to_predictor(inference_state, eye_prompts, inference_scale):
logger.error(f"Failed to add prompts for {eye_side} eye")
return None
elif previous_masks:
# Convert previous masks to use obj_id=1 for single eye processing
eye_masks = {1: list(previous_masks.values())[0]} if previous_masks else {}
if not self.add_previous_masks_to_predictor(inference_state, eye_masks):
logger.error(f"Failed to add previous masks for {eye_side} eye")
return None
else:
logger.error(f"No prompts or previous masks available for {eye_side} eye of segment {segment_idx}")
return None
# Propagate masks
logger.info(f"Propagating masks for {eye_side} eye")
video_segments = self.propagate_masks(inference_state)
# Extract just the masks (remove obj_id structure since we only use obj_id=1)
eye_masks = {}
for frame_idx, frame_masks in video_segments.items():
if 1 in frame_masks: # We always use obj_id=1 for single eye processing
eye_masks[frame_idx] = frame_masks[1]
# Clean up
self.predictor.reset_state(inference_state)
del inference_state
gc.collect()
# Remove temporary low-res video
try:
os.remove(low_res_eye_video_path)
logger.debug(f"Removed low-res {eye_side} eye video: {low_res_eye_video_path}")
except Exception as e:
logger.warning(f"Could not remove low-res {eye_side} eye video: {e}")
logger.info(f"Successfully processed {eye_side} eye with {len(eye_masks)} frames")
return eye_masks
except Exception as e:
logger.error(f"Error processing {eye_side} eye for segment {segment_idx}: {e}")
return None
def process_segment_with_separate_eyes(self, segment_info: dict,
left_prompts: Optional[List[Dict[str, Any]]] = None,
right_prompts: Optional[List[Dict[str, Any]]] = None,
previous_left_masks: Optional[Dict[int, np.ndarray]] = None,
previous_right_masks: Optional[Dict[int, np.ndarray]] = None,
inference_scale: float = 0.5,
full_frame_shape: Optional[Tuple[int, int]] = None) -> Optional[Dict[int, Dict[int, np.ndarray]]]:
"""
Process a VR180 segment with separate left and right eye processing.
Args:
segment_info: Segment information dictionary
left_prompts: Optional YOLO prompts for left eye
right_prompts: Optional YOLO prompts for right eye
previous_left_masks: Optional previous masks for left eye
previous_right_masks: Optional previous masks for right eye
inference_scale: Scale factor for inference
full_frame_shape: Shape of full VR180 frame (height, width)
Returns:
Combined video segments dictionary or None if failed
"""
if not self.eye_processor:
logger.error("Eye processor not initialized - separate_eye_processing must be enabled")
return None
segment_idx = segment_info['index']
logger.info(f"Processing segment {segment_idx} with separate eye processing")
# Get full frame shape if not provided
if full_frame_shape is None:
try:
cap = cv2.VideoCapture(segment_info['video_file'])
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
cap.release()
full_frame_shape = (height, width)
except Exception as e:
logger.error(f"Could not determine frame shape: {e}")
return None
# Process left eye if prompts or previous masks are available
left_masks = None
if left_prompts or previous_left_masks:
logger.info(f"Processing left eye for segment {segment_idx}")
left_masks = self.process_single_eye_segment(
segment_info, 'left', left_prompts, previous_left_masks, inference_scale
)
# Process right eye if prompts or previous masks are available
right_masks = None
if right_prompts or previous_right_masks:
logger.info(f"Processing right eye for segment {segment_idx}")
right_masks = self.process_single_eye_segment(
segment_info, 'right', right_prompts, previous_right_masks, inference_scale
)
# Combine masks back to full frame format
if left_masks or right_masks:
logger.info(f"Combining eye masks for segment {segment_idx}")
combined_masks = self.eye_processor.combine_eye_masks(
left_masks, right_masks, full_frame_shape
)
# Clean up eye-specific videos to save space
try:
left_eye_path = os.path.join(segment_info['directory'], "left_eye_video.mp4")
right_eye_path = os.path.join(segment_info['directory'], "right_eye_video.mp4")
if os.path.exists(left_eye_path):
os.remove(left_eye_path)
logger.debug(f"Removed left eye video: {left_eye_path}")
if os.path.exists(right_eye_path):
os.remove(right_eye_path)
logger.debug(f"Removed right eye video: {right_eye_path}")
except Exception as e:
logger.warning(f"Could not clean up eye videos: {e}")
logger.info(f"Successfully processed segment {segment_idx} with separate eyes")
return combined_masks
else:
logger.warning(f"No masks generated for either eye in segment {segment_idx}")
return None
def create_greenscreen_segment(self, segment_info: dict, green_color: List[int] = [0, 255, 0]) -> bool:
"""
Create a full greenscreen segment when no humans are detected.
Args:
segment_info: Segment information dictionary
green_color: RGB values for green screen color
Returns:
True if greenscreen segment was created successfully
"""
segment_dir = segment_info['directory']
video_path = segment_info['video_file']
segment_idx = segment_info['index']
logger.info(f"Creating full greenscreen segment {segment_idx}")
try:
# Get video properties
cap = cv2.VideoCapture(video_path)
if not cap.isOpened():
logger.error(f"Could not open video: {video_path}")
return False
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
cap.release()
# Create output video path
output_video_path = os.path.join(segment_dir, f"output_{segment_idx}.mp4")
# Create greenscreen frames
greenscreen_frame = self.eye_processor.create_full_greenscreen_frame(
(height, width, 3), green_color
)
# Write greenscreen video
fourcc = cv2.VideoWriter_fourcc(*'HEVC')
out = cv2.VideoWriter(output_video_path, fourcc, fps, (width, height))
for _ in range(frame_count):
out.write(greenscreen_frame)
out.release()
# Create mask file (empty/black mask since no humans detected)
mask_output_path = os.path.join(segment_dir, "mask.png")
black_mask = np.zeros((height, width, 3), dtype=np.uint8)
cv2.imwrite(mask_output_path, black_mask)
# Mark segment as completed
output_done_file = os.path.join(segment_dir, "output_frames_done")
with open(output_done_file, 'w') as f:
f.write(f"Greenscreen segment {segment_idx} completed successfully\n")
logger.info(f"Successfully created greenscreen segment {segment_idx}")
return True
except Exception as e:
logger.error(f"Error creating greenscreen segment {segment_idx}: {e}")
return False

306
core/video_assembler.py Normal file
View File

@@ -0,0 +1,306 @@
"""
Video assembler module for concatenating processed segments.
Handles merging processed segments and adding audio from original video.
"""
import os
import subprocess
import logging
from typing import List, Optional
from utils.file_utils import get_segments_directories, file_exists
logger = logging.getLogger(__name__)
class VideoAssembler:
"""Handles final video assembly from processed segments."""
def __init__(self, preserve_audio: bool = True, use_nvenc: bool = False,
output_mode: str = "green_screen"):
"""
Initialize video assembler.
Args:
preserve_audio: Whether to preserve audio from original video
use_nvenc: Whether to use hardware encoding for final output
output_mode: Output mode - "green_screen" or "alpha_channel"
"""
self.preserve_audio = preserve_audio
self.use_nvenc = use_nvenc
self.output_mode = output_mode
def create_concat_file(self, segments_dir: str, output_filename: str = "concat_list.txt") -> Optional[str]:
"""
Create a concatenation file for FFmpeg.
Args:
segments_dir: Directory containing processed segments
output_filename: Name for the concat file
Returns:
Path to concat file or None if no valid segments found
"""
concat_path = os.path.join(segments_dir, output_filename)
valid_segments = 0
try:
segments = get_segments_directories(segments_dir)
with open(concat_path, 'w') as f:
for i, segment in enumerate(segments):
segment_dir = os.path.join(segments_dir, segment)
if self.output_mode == "alpha_channel":
output_video = os.path.join(segment_dir, f"output_{i}.mov")
else:
output_video = os.path.join(segment_dir, f"output_{i}.mp4")
if file_exists(output_video):
# Use relative path for FFmpeg
relative_path = os.path.relpath(output_video, segments_dir)
f.write(f"file '{relative_path}'\n")
valid_segments += 1
else:
logger.warning(f"Output video not found for segment {i}: {output_video}")
if valid_segments == 0:
logger.error("No valid output segments found for concatenation")
os.remove(concat_path)
return None
logger.info(f"Created concatenation file with {valid_segments} segments: {concat_path}")
return concat_path
except Exception as e:
logger.error(f"Error creating concatenation file: {e}")
return None
def concatenate_segments(self, segments_dir: str, output_path: str,
bitrate: str = "50M") -> bool:
"""
Concatenate video segments using FFmpeg.
Args:
segments_dir: Directory containing processed segments
output_path: Path for final concatenated video
bitrate: Output video bitrate
Returns:
True if successful
"""
# Create concatenation file
concat_file = self.create_concat_file(segments_dir)
if not concat_file:
return False
try:
# Build FFmpeg command
if self.output_mode == "alpha_channel":
# For alpha channel, we need to maintain the ProRes codec
cmd = [
'ffmpeg',
'-y', # Overwrite output
'-f', 'concat',
'-safe', '0',
'-i', concat_file,
'-c:v', 'copy', # Copy video codec to preserve alpha
'-an', # No audio for now
output_path
]
else:
cmd = [
'ffmpeg',
'-y', # Overwrite output
'-f', 'concat',
'-safe', '0',
'-i', concat_file,
'-c:v', 'copy', # Copy video codec (no re-encoding)
'-an', # No audio for now
output_path
]
# Use hardware encoding if requested
if self.use_nvenc:
import sys
if sys.platform == 'darwin':
encoder = 'hevc_videotoolbox'
else:
encoder = 'hevc_nvenc'
# Re-encode with hardware acceleration
cmd = [
'ffmpeg',
'-y',
'-f', 'concat',
'-safe', '0',
'-i', concat_file,
'-c:v', encoder,
'-preset', 'slow',
'-b:v', bitrate,
'-pix_fmt', 'yuv420p',
'-an',
output_path
]
logger.info(f"Running concatenation command: {' '.join(cmd)}")
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
logger.error(f"FFmpeg concatenation failed: {result.stderr}")
return False
logger.info(f"Successfully concatenated segments to: {output_path}")
# Clean up concat file
try:
os.remove(concat_file)
except:
pass
return True
except Exception as e:
logger.error(f"Error during concatenation: {e}")
return False
def copy_audio_from_original(self, original_video: str, processed_video: str,
final_output: str) -> bool:
"""
Copy audio track from original video to processed video.
Args:
original_video: Path to original video with audio
processed_video: Path to processed video without audio
final_output: Path for final output with audio
Returns:
True if successful
"""
if not self.preserve_audio:
logger.info("Audio preservation disabled, skipping audio copy")
return True
try:
# Check if original video has audio
probe_cmd = [
'ffprobe',
'-v', 'error',
'-select_streams', 'a:0',
'-show_entries', 'stream=codec_type',
'-of', 'csv=p=0',
original_video
]
result = subprocess.run(probe_cmd, capture_output=True, text=True)
if result.returncode != 0 or result.stdout.strip() != 'audio':
logger.warning("Original video has no audio track")
# Just copy the processed video
import shutil
shutil.copy2(processed_video, final_output)
return True
# Copy audio from original to processed video
cmd = [
'ffmpeg',
'-y',
'-i', processed_video, # Video input
'-i', original_video, # Audio input
'-c:v', 'copy', # Copy video stream
'-c:a', 'copy', # Copy audio stream
'-map', '0:v:0', # Map video from first input
'-map', '1:a:0', # Map audio from second input
'-shortest', # Match duration to shortest stream
final_output
]
logger.info("Copying audio from original video...")
result = subprocess.run(cmd, capture_output=True, text=True)
if result.returncode != 0:
logger.error(f"FFmpeg audio copy failed: {result.stderr}")
return False
logger.info(f"Successfully added audio to final video: {final_output}")
return True
except Exception as e:
logger.error(f"Error copying audio: {e}")
return False
def assemble_final_video(self, segments_dir: str, original_video: str,
output_path: str, bitrate: str = "50M") -> bool:
"""
Complete pipeline to assemble final video with audio.
Args:
segments_dir: Directory containing processed segments
original_video: Path to original video (for audio)
output_path: Path for final output video
bitrate: Output video bitrate
Returns:
True if successful
"""
logger.info("Starting final video assembly...")
# Step 1: Concatenate segments
temp_concat_path = os.path.join(os.path.dirname(output_path), "temp_concat.mp4")
if not self.concatenate_segments(segments_dir, temp_concat_path, bitrate):
logger.error("Failed to concatenate segments")
return False
# Step 2: Add audio from original
if self.preserve_audio and file_exists(original_video):
success = self.copy_audio_from_original(original_video, temp_concat_path, output_path)
# Clean up temp file
try:
os.remove(temp_concat_path)
except:
pass
return success
else:
# No audio to add, just rename temp file
import shutil
try:
shutil.move(temp_concat_path, output_path)
logger.info(f"Final video saved to: {output_path}")
return True
except Exception as e:
logger.error(f"Error moving final video: {e}")
return False
def verify_segment_completeness(self, segments_dir: str) -> tuple[bool, List[int]]:
"""
Verify all segments have been processed.
Args:
segments_dir: Directory containing segments
Returns:
Tuple of (all_complete, missing_segments)
"""
segments = get_segments_directories(segments_dir)
missing_segments = []
for i, segment in enumerate(segments):
segment_dir = os.path.join(segments_dir, segment)
if self.output_mode == "alpha_channel":
output_video = os.path.join(segment_dir, f"output_{i}.mov")
else:
output_video = os.path.join(segment_dir, f"output_{i}.mp4")
if not file_exists(output_video):
missing_segments.append(i)
all_complete = len(missing_segments) == 0
if all_complete:
logger.info(f"All {len(segments)} segments have been processed")
else:
logger.warning(f"Missing output for segments: {missing_segments}")
return all_complete, missing_segments

View File

@@ -44,6 +44,14 @@ class VideoSplitter:
segments_dir = os.path.join(output_dir, f"{video_name}_segments")
ensure_directory(segments_dir)
# Check for completion marker to avoid re-splitting
completion_marker = os.path.join(segments_dir, ".splitting_done")
if os.path.exists(completion_marker):
logger.info(f"Video already split, skipping splitting process. Found completion marker: {completion_marker}")
segment_dirs = [d for d in os.listdir(segments_dir) if os.path.isdir(os.path.join(segments_dir, d)) and d.startswith("segment_")]
segment_dirs.sort(key=lambda x: int(x.split("_")[1]))
return segments_dir, segment_dirs
logger.info(f"Splitting video {input_video} into {self.segment_duration}s segments")
# Split video using ffmpeg
@@ -83,6 +91,11 @@ class VideoSplitter:
# Create file list for later concatenation
self._create_file_list(segments_dir, segment_dirs)
# Create completion marker
completion_marker = os.path.join(segments_dir, ".splitting_done")
with open(completion_marker, 'w') as f:
f.write("Video splitting completed successfully.")
logger.info(f"Successfully split video into {len(segment_dirs)} segments")
return segments_dir, segment_dirs

View File

@@ -61,26 +61,36 @@ class YOLODetector:
logger.error(f"Failed to load YOLO model: {e}")
raise
def detect_humans_in_frame(self, frame: np.ndarray) -> List[Dict[str, Any]]:
def detect_humans_in_frame(self, frame: np.ndarray, confidence_override: Optional[float] = None,
validate_with_detection: bool = False) -> List[Dict[str, Any]]:
"""
Detect humans in a single frame using YOLO.
Args:
frame: Input frame (BGR format from OpenCV)
confidence_override: Optional confidence to use instead of the default
validate_with_detection: If True and in segmentation mode, validate masks against detection bboxes
Returns:
List of human detection dictionaries with bbox, confidence, and optionally masks
"""
# Run YOLO detection/segmentation
results = self.model(frame, conf=self.confidence_threshold, verbose=False)
confidence = confidence_override if confidence_override is not None else self.confidence_threshold
results = self.model(frame, conf=confidence, verbose=False)
human_detections = []
# Process results
for result in results:
for result_idx, result in enumerate(results):
boxes = result.boxes
masks = result.masks if hasattr(result, 'masks') and result.masks is not None else None
logger.debug(f"YOLO Result {result_idx}: boxes={boxes is not None}, masks={masks is not None}")
if boxes is not None:
logger.debug(f" Found {len(boxes)} total boxes")
if masks is not None:
logger.debug(f" Found {len(masks.data)} total masks")
if boxes is not None:
for i, box in enumerate(boxes):
# Get class ID
@@ -101,18 +111,30 @@ class YOLODetector:
# Extract mask if available (segmentation mode)
if masks is not None and i < len(masks.data):
mask_data = masks.data[i].cpu().numpy() # Get mask for this detection
# Resize the raw mask to match the input frame dimensions
raw_mask = masks.data[i].cpu().numpy()
resized_mask = cv2.resize(raw_mask, (frame.shape[1], frame.shape[0]), interpolation=cv2.INTER_NEAREST)
mask_area = np.sum(resized_mask > 0.5)
detection['has_mask'] = True
detection['mask'] = mask_data
logger.debug(f"YOLO Segmentation: Detected human with mask - conf={conf:.2f}, mask_shape={mask_data.shape}")
detection['mask'] = resized_mask
logger.info(f"YOLO Segmentation: Human {len(human_detections)} - conf={conf:.3f}, raw_mask_shape={raw_mask.shape}, frame_shape={frame.shape}, resized_mask_shape={resized_mask.shape}, mask_area={mask_area}px")
else:
logger.debug(f"YOLO Detection: Detected human with bbox - conf={conf:.2f}, bbox={coords}")
logger.debug(f"YOLO Detection: Human {len(human_detections)} - conf={conf:.3f}, bbox={coords} (no mask)")
human_detections.append(detection)
else:
logger.debug(f"YOLO: Skipping non-human detection (class {cls})")
if self.supports_segmentation:
masks_found = sum(1 for d in human_detections if d['has_mask'])
logger.info(f"YOLO Segmentation: Found {len(human_detections)} humans, {masks_found} with masks")
# Optional validation with detection model
if validate_with_detection and masks_found > 0:
logger.info("Validating segmentation masks with detection model...")
validated_detections = self._validate_masks_with_detection(frame, human_detections, confidence_override)
return validated_detections
else:
logger.debug(f"YOLO Detection: Found {len(human_detections)} humans with bounding boxes")
@@ -732,4 +754,804 @@ class YOLODetector:
except Exception as e:
logger.error(f"Error creating debug frame: {e}")
return False
return False
def detect_humans_in_single_eye(self, frame: np.ndarray, eye_side: str) -> List[Dict[str, Any]]:
"""
Detect humans in a single eye frame (left or right).
Args:
frame: Input eye frame (BGR format)
eye_side: 'left' or 'right' eye
Returns:
List of human detection dictionaries for the single eye
"""
logger.info(f"Running YOLO detection on {eye_side} eye frame")
# Run standard detection on the eye frame
detections = self.detect_humans_in_frame(frame)
logger.info(f"YOLO {eye_side.upper()} Eye: Found {len(detections)} human detections")
for i, detection in enumerate(detections):
bbox = detection['bbox']
conf = detection['confidence']
has_mask = detection.get('has_mask', False)
logger.debug(f"YOLO {eye_side.upper()} Eye Detection {i+1}: bbox={bbox}, conf={conf:.3f}, has_mask={has_mask}")
return detections
def convert_eye_detections_to_sam2_prompts(self, detections: List[Dict[str, Any]],
eye_side: str) -> List[Dict[str, Any]]:
"""
Convert single eye detections to SAM2 prompts (always uses obj_id=1 for single eye processing).
Args:
detections: List of YOLO detection results for single eye
eye_side: 'left' or 'right' eye
Returns:
List of SAM2 prompt dictionaries with obj_id=1 for single eye processing
"""
if not detections:
logger.warning(f"No detections provided for {eye_side} eye SAM2 prompt conversion")
return []
logger.info(f"Converting {len(detections)} {eye_side} eye detections to SAM2 prompts")
prompts = []
# For single eye processing, always use obj_id=1 and take the best detection
best_detection = max(detections, key=lambda x: x['confidence'])
prompts.append({
'obj_id': 1, # Always use obj_id=1 for single eye processing
'bbox': best_detection['bbox'].copy(),
'confidence': best_detection['confidence']
})
logger.info(f"{eye_side.upper()} Eye: Converted best detection (conf={best_detection['confidence']:.3f}) to SAM2 Object 1")
return prompts
def has_any_detections(self, detections_list: List[List[Dict[str, Any]]]) -> bool:
"""
Check if any detections exist in a list of detection lists.
Args:
detections_list: List of detection lists (e.g., [left_detections, right_detections])
Returns:
True if any detections are found
"""
for detections in detections_list:
if detections:
return True
return False
def split_detections_by_eye(self, detections: List[Dict[str, Any]], frame_width: int) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]:
"""
Split VR180 detections into left and right eye detections with coordinate conversion.
Args:
detections: List of full-frame VR180 detections
frame_width: Width of the full VR180 frame
Returns:
Tuple of (left_eye_detections, right_eye_detections) with converted coordinates
"""
half_width = frame_width // 2
left_detections = []
right_detections = []
logger.info(f"Splitting {len(detections)} VR180 detections by eye (frame_width={frame_width}, half_width={half_width})")
for i, detection in enumerate(detections):
bbox = detection['bbox']
center_x = (bbox[0] + bbox[2]) / 2
logger.info(f"Detection {i}: bbox={bbox}, center_x={center_x:.1f}")
# Create a copy with converted coordinates
converted_detection = detection.copy()
converted_bbox = bbox.copy()
if center_x < half_width:
# Left eye detection - coordinates remain the same
# For segmentation mode, we also need to crop the mask to the left eye
if detection.get('has_mask', False) and 'mask' in detection:
original_mask = detection['mask']
# Crop mask to left half (keep original coordinates for now, will be handled in eye processing)
converted_detection['mask'] = original_mask
logger.info(f"Detection {i}: LEFT eye mask shape: {original_mask.shape}")
left_detections.append(converted_detection)
logger.info(f"Detection {i}: Assigned to LEFT eye, center_x={center_x:.1f} < {half_width}, bbox={bbox}")
else:
# Right eye detection - shift coordinates to start from 0
original_bbox = converted_bbox.copy()
converted_bbox[0] -= half_width # x1
converted_bbox[2] -= half_width # x2
# Ensure coordinates are within bounds
converted_bbox[0] = max(0, converted_bbox[0])
converted_bbox[2] = max(0, min(converted_bbox[2], half_width))
converted_detection['bbox'] = converted_bbox
# For segmentation mode, we also need to crop the mask to the right eye
if detection.get('has_mask', False) and 'mask' in detection:
original_mask = detection['mask']
# Crop mask to right half and shift coordinates
# Note: This is a simplified approach - the mask coordinates need to be handled properly
converted_detection['mask'] = original_mask # Will be properly handled in eye processing
logger.info(f"Detection {i}: RIGHT eye mask shape: {original_mask.shape}")
right_detections.append(converted_detection)
logger.info(f"Detection {i}: Assigned to RIGHT eye, center_x={center_x:.1f} >= {half_width}, original_bbox={original_bbox}, converted_bbox={converted_bbox}")
logger.info(f"Split result: {len(left_detections)} left eye, {len(right_detections)} right eye detections")
return left_detections, right_detections
def save_eye_debug_frames(self, left_frame: np.ndarray, right_frame: np.ndarray,
left_detections: List[Dict[str, Any]], right_detections: List[Dict[str, Any]],
left_output_path: str, right_output_path: str) -> Tuple[bool, bool]:
"""
Save debug frames for both left and right eye detections.
Args:
left_frame: Left eye frame
right_frame: Right eye frame
left_detections: Left eye detections
right_detections: Right eye detections
left_output_path: Output path for left eye debug frame
right_output_path: Output path for right eye debug frame
Returns:
Tuple of (left_success, right_success)
"""
logger.info(f"Saving eye-specific debug frames")
# Save left eye debug frame (eye-specific version)
left_success = self._save_single_eye_debug_frame(
left_frame, left_detections, left_output_path, "LEFT"
)
# Save right eye debug frame (eye-specific version)
right_success = self._save_single_eye_debug_frame(
right_frame, right_detections, right_output_path, "RIGHT"
)
if left_success:
logger.info(f"Saved left eye debug frame: {left_output_path}")
if right_success:
logger.info(f"Saved right eye debug frame: {right_output_path}")
return left_success, right_success
def _save_single_eye_debug_frame(self, frame: np.ndarray, detections: List[Dict[str, Any]],
output_path: str, eye_side: str) -> bool:
"""
Save a debug frame for a single eye with eye-specific visualizations.
Args:
frame: Single eye frame (BGR format from OpenCV)
detections: List of detection dictionaries for this eye
output_path: Path to save the debug image
eye_side: "LEFT" or "RIGHT"
Returns:
True if saved successfully
"""
try:
debug_frame = frame.copy()
# Draw masks or bounding boxes for each detection
for i, detection in enumerate(detections):
bbox = detection['bbox']
confidence = detection['confidence']
has_mask = detection.get('has_mask', False)
# Extract coordinates
x1, y1, x2, y2 = map(int, bbox)
# Choose color based on confidence (green for high, yellow for medium, red for low)
if confidence >= 0.8:
color = (0, 255, 0) # Green
elif confidence >= 0.6:
color = (0, 255, 255) # Yellow
else:
color = (0, 0, 255) # Red
if has_mask and 'mask' in detection:
# Draw segmentation mask
mask = detection['mask']
# Resize mask to match frame if needed
if mask.shape != debug_frame.shape[:2]:
mask = cv2.resize(mask.astype(np.float32), (debug_frame.shape[1], debug_frame.shape[0]), interpolation=cv2.INTER_NEAREST)
mask = mask > 0.5
mask = mask.astype(bool)
# Apply colored overlay with transparency
overlay = debug_frame.copy()
overlay[mask] = color
cv2.addWeighted(overlay, 0.3, debug_frame, 0.7, 0, debug_frame)
# Draw mask outline
contours, _ = cv2.findContours(mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
cv2.drawContours(debug_frame, contours, -1, color, 2)
# Prepare label text for segmentation
label = f"Person {i+1}: {confidence:.2f} (MASK)"
else:
# Draw bounding box (detection mode or no mask available)
cv2.rectangle(debug_frame, (x1, y1), (x2, y2), color, 2)
# Prepare label text for detection
label = f"Person {i+1}: {confidence:.2f} (BBOX)"
label_size = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.6, 2)[0]
# Draw label background
cv2.rectangle(debug_frame,
(x1, y1 - label_size[1] - 10),
(x1 + label_size[0], y1),
color, -1)
# Draw label text
cv2.putText(debug_frame, label,
(x1, y1 - 5),
cv2.FONT_HERSHEY_SIMPLEX, 0.6,
(255, 255, 255), 2)
# Add title specific to this eye
frame_height, frame_width = debug_frame.shape[:2]
title = f"{eye_side} EYE: {len(detections)} detections"
cv2.putText(debug_frame, title, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (255, 255, 255), 2)
# Add mode information
mode_text = f"YOLO Mode: {self.mode.upper()}"
masks_available = sum(1 for d in detections if d.get('has_mask', False))
if self.supports_segmentation and masks_available > 0:
summary = f"{len(detections)} detections → {masks_available} MASKS"
else:
summary = f"{len(detections)} detections → BOUNDING BOXES"
cv2.putText(debug_frame, mode_text,
(10, 60),
cv2.FONT_HERSHEY_SIMPLEX, 0.8,
(0, 255, 255), 2) # Yellow for mode
cv2.putText(debug_frame, summary,
(10, 90),
cv2.FONT_HERSHEY_SIMPLEX, 0.8,
(255, 255, 255), 2)
# Add frame dimensions info
dims_info = f"Frame: {frame_width}x{frame_height}"
cv2.putText(debug_frame, dims_info,
(10, 120),
cv2.FONT_HERSHEY_SIMPLEX, 0.6,
(255, 255, 255), 2)
# Save debug frame
success = cv2.imwrite(output_path, debug_frame)
if success:
logger.info(f"Saved {eye_side} eye debug frame to {output_path}")
else:
logger.error(f"Failed to save {eye_side} eye debug frame to {output_path}")
return success
except Exception as e:
logger.error(f"Error creating {eye_side} eye debug frame: {e}")
return False
def _calculate_iou(self, mask1: np.ndarray, mask2: np.ndarray) -> float:
"""Calculate Intersection over Union for two masks of the same size."""
if mask1.shape != mask2.shape:
return 0.0
intersection = np.logical_and(mask1, mask2).sum()
union = np.logical_or(mask1, mask2).sum()
return intersection / union if union > 0 else 0.0
def _calculate_stereo_similarity(self, left_mask: np.ndarray, right_mask: np.ndarray,
left_bbox: np.ndarray, right_bbox: np.ndarray,
left_idx: int = -1, right_idx: int = -1) -> float:
"""
Calculate stereo similarity for VR180 masks using spatial and size features.
For VR180, left and right eye views won't overlap much, so we use other metrics.
"""
logger.info(f" Starting similarity calculation L{left_idx} vs R{right_idx}")
logger.info(f" Left mask: shape={left_mask.shape}, dtype={left_mask.dtype}, min={left_mask.min()}, max={left_mask.max()}")
logger.info(f" Right mask: shape={right_mask.shape}, dtype={right_mask.dtype}, min={right_mask.min()}, max={right_mask.max()}")
logger.info(f" Left bbox: {left_bbox}")
logger.info(f" Right bbox: {right_bbox}")
if left_mask.shape != right_mask.shape:
logger.info(f" L{left_idx} vs R{right_idx}: Shape mismatch - {left_mask.shape} vs {right_mask.shape} - attempting to resize")
# Try to resize the smaller mask to match the larger one
if left_mask.size < right_mask.size:
left_mask = cv2.resize(left_mask.astype(np.float32), (right_mask.shape[1], right_mask.shape[0]), interpolation=cv2.INTER_NEAREST)
left_mask = left_mask > 0.5
logger.info(f" Resized left mask to {left_mask.shape}")
else:
right_mask = cv2.resize(right_mask.astype(np.float32), (left_mask.shape[1], left_mask.shape[0]), interpolation=cv2.INTER_NEAREST)
right_mask = right_mask > 0.5
logger.info(f" Resized right mask to {right_mask.shape}")
if left_mask.shape != right_mask.shape:
logger.warning(f" L{left_idx} vs R{right_idx}: Still shape mismatch after resize - {left_mask.shape} vs {right_mask.shape}")
return 0.0
# 1. Size similarity (area ratio)
left_area = np.sum(left_mask)
right_area = np.sum(right_mask)
if left_area == 0 or right_area == 0:
logger.debug(f" L{left_idx} vs R{right_idx}: Zero area - left={left_area}, right={right_area}")
return 0.0
area_ratio = min(left_area, right_area) / max(left_area, right_area)
# 2. Vertical position similarity (y-coordinates should be similar)
left_center_y = (left_bbox[1] + left_bbox[3]) / 2
right_center_y = (right_bbox[1] + right_bbox[3]) / 2
height = left_mask.shape[0]
y_diff = abs(left_center_y - right_center_y) / height
y_similarity = max(0, 1.0 - y_diff * 2) # Penalize vertical misalignment
# 3. Height similarity (bounding box heights should be similar)
left_height = left_bbox[3] - left_bbox[1]
right_height = right_bbox[3] - right_bbox[1]
if left_height == 0 or right_height == 0:
height_ratio = 0.0
else:
height_ratio = min(left_height, right_height) / max(left_height, right_height)
# 4. Aspect ratio similarity
left_width = left_bbox[2] - left_bbox[0]
right_width = right_bbox[2] - right_bbox[0]
if left_width == 0 or right_width == 0 or left_height == 0 or right_height == 0:
aspect_similarity = 0.0
else:
left_aspect = left_width / left_height
right_aspect = right_width / right_height
aspect_diff = abs(left_aspect - right_aspect) / max(left_aspect, right_aspect)
aspect_similarity = max(0, 1.0 - aspect_diff)
# Combine metrics with weights
similarity = (
area_ratio * 0.3 + # 30% weight on size similarity
y_similarity * 0.4 + # 40% weight on vertical alignment
height_ratio * 0.2 + # 20% weight on height similarity
aspect_similarity * 0.1 # 10% weight on aspect ratio
)
# Detailed logging for each comparison
logger.info(f" L{left_idx} vs R{right_idx}: area_ratio={area_ratio:.3f} (L={left_area}px, R={right_area}px), "
f"y_sim={y_similarity:.3f} (L_y={left_center_y:.1f}, R_y={right_center_y:.1f}, diff={y_diff:.3f}), "
f"height_ratio={height_ratio:.3f} (L_h={left_height:.1f}, R_h={right_height:.1f}), "
f"aspect_sim={aspect_similarity:.3f} (L_asp={left_aspect:.2f}, R_asp={right_aspect:.2f}), "
f"FINAL_SIMILARITY={similarity:.3f}")
return similarity
def _find_matching_mask_pairs(self, left_masks: List[Dict[str, Any]], right_masks: List[Dict[str, Any]],
similarity_threshold: float) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]], List[Dict[str, Any]]]:
"""Find the best matching pairs of masks between left and right eyes using stereo similarity."""
logger.info(f"Starting stereo mask matching with {len(left_masks)} left masks and {len(right_masks)} right masks.")
if not left_masks or not right_masks:
return [], left_masks, right_masks
# 1. Calculate all similarity scores for every possible pair
possible_pairs = []
logger.info("--- Calculating all possible stereo similarity pairs ---")
# First, log details about each mask
logger.info(f"LEFT EYE MASKS ({len(left_masks)} total):")
for i, left_detection in enumerate(left_masks):
bbox = left_detection['bbox']
mask_area = np.sum(left_detection['mask'])
conf = left_detection['confidence']
logger.info(f" L{i}: bbox=[{bbox[0]:.1f},{bbox[1]:.1f},{bbox[2]:.1f},{bbox[3]:.1f}], area={mask_area}px, conf={conf:.3f}")
logger.info(f"RIGHT EYE MASKS ({len(right_masks)} total):")
for j, right_detection in enumerate(right_masks):
bbox = right_detection['bbox']
mask_area = np.sum(right_detection['mask'])
conf = right_detection['confidence']
logger.info(f" R{j}: bbox=[{bbox[0]:.1f},{bbox[1]:.1f},{bbox[2]:.1f},{bbox[3]:.1f}], area={mask_area}px, conf={conf:.3f}")
logger.info("--- Stereo Similarity Calculations ---")
for i, left_detection in enumerate(left_masks):
for j, right_detection in enumerate(right_masks):
try:
# Use stereo similarity instead of IOU for VR180
similarity = self._calculate_stereo_similarity(
left_detection['mask'], right_detection['mask'],
left_detection['bbox'], right_detection['bbox'],
left_idx=i, right_idx=j
)
if similarity > similarity_threshold:
possible_pairs.append({'left_idx': i, 'right_idx': j, 'similarity': similarity})
logger.info(f" ✓ L{i} vs R{j}: ABOVE THRESHOLD ({similarity:.4f} > {similarity_threshold:.4f})")
else:
logger.info(f" ✗ L{i} vs R{j}: BELOW THRESHOLD ({similarity:.4f} <= {similarity_threshold:.4f})")
except Exception as e:
logger.error(f" ERROR L{i} vs R{j}: Exception in similarity calculation: {e}")
similarity = 0.0
# 2. Sort pairs by similarity score in descending order to prioritize the best matches
possible_pairs.sort(key=lambda x: x['similarity'], reverse=True)
logger.debug("--- Sorted similarity pairs above threshold ---")
for pair in possible_pairs:
logger.debug(f" Pair (L{pair['left_idx']}, R{pair['right_idx']}) - Similarity: {pair['similarity']:.4f}")
matched_pairs = []
matched_left_indices = set()
matched_right_indices = set()
# 3. Iterate through sorted pairs and greedily select the best available ones
logger.debug("--- Selecting best pairs ---")
for pair in possible_pairs:
left_idx, right_idx = pair['left_idx'], pair['right_idx']
if left_idx not in matched_left_indices and right_idx not in matched_right_indices:
logger.info(f" MATCH FOUND: (L{left_idx}, R{right_idx}) with Similarity {pair['similarity']:.4f}")
matched_pairs.append({
'left_mask': left_masks[left_idx],
'right_mask': right_masks[right_idx],
'similarity': pair['similarity'] # Changed from 'iou' to 'similarity'
})
matched_left_indices.add(left_idx)
matched_right_indices.add(right_idx)
else:
logger.debug(f" Skipping pair (L{left_idx}, R{right_idx}) because one mask is already matched.")
# 4. Identify unmatched (orphan) masks
unmatched_left = [mask for i, mask in enumerate(left_masks) if i not in matched_left_indices]
unmatched_right = [mask for i, mask in enumerate(right_masks) if i not in matched_right_indices]
logger.info(f"Matching complete: Found {len(matched_pairs)} pairs. Left orphans: {len(unmatched_left)}, Right orphans: {len(unmatched_right)}.")
return matched_pairs, unmatched_left, unmatched_right
def _save_stereo_agreement_debug_frame(self, left_frame: np.ndarray, right_frame: np.ndarray,
left_detections: List[Dict[str, Any]], right_detections: List[Dict[str, Any]],
matched_pairs: List[Dict[str, Any]], unmatched_left: List[Dict[str, Any]],
unmatched_right: List[Dict[str, Any]], output_path: str, title: str):
"""Save a debug frame visualizing the stereo mask agreement process."""
try:
# Create a combined image
h, w, _ = left_frame.shape
combined_frame = np.hstack((left_frame, right_frame))
def get_centroid(mask):
m = cv2.moments(mask.astype(np.uint8), binaryImage=True)
return (int(m["m10"] / m["m00"]), int(m["m01"] / m["m00"])) if m["m00"] != 0 else (0,0)
def draw_label(frame, text, pos, color):
# Draw a black background rectangle
cv2.rectangle(frame, (pos[0], pos[1] - 14), (pos[0] + len(text) * 8, pos[1] + 5), (0,0,0), -1)
# Draw the text
cv2.putText(frame, text, pos, cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 1)
# --- Draw ALL Masks First (to ensure every mask gets a label) ---
logger.info(f"Debug Frame: Drawing {len(left_detections)} left masks and {len(right_detections)} right masks")
# Draw all left detections first
for i, detection in enumerate(left_detections):
mask = detection['mask']
mask_area = np.sum(mask > 0.5)
# Skip tiny masks that are likely noise
if mask_area < 100: # Less than 100 pixels
logger.debug(f"Skipping tiny left mask L{i} with area {mask_area}px")
continue
contours, _ = cv2.findContours(mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
if contours:
cv2.drawContours(combined_frame, contours, -1, (0, 0, 255), 2) # Default red for unmatched
c = get_centroid(mask)
if c[0] > 0 and c[1] > 0: # Valid centroid
draw_label(combined_frame, f"L{i}", c, (0, 0, 255))
logger.debug(f"Drew left mask L{i} at centroid {c}, area={mask_area}px")
# Draw all right detections
for i, detection in enumerate(right_detections):
mask = detection['mask']
mask_area = np.sum(mask > 0.5)
# Skip tiny masks that are likely noise
if mask_area < 100: # Less than 100 pixels
logger.debug(f"Skipping tiny right mask R{i} with area {mask_area}px")
continue
contours, _ = cv2.findContours(mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
if contours:
for cnt in contours:
cnt[:, :, 0] += w
cv2.drawContours(combined_frame, contours, -1, (0, 0, 255), 2) # Default red for unmatched
c_shifted = get_centroid(mask)
c = (c_shifted[0] + w, c_shifted[1])
if c[0] > w and c[1] > 0: # Valid centroid in right half
draw_label(combined_frame, f"R{i}", c, (0, 0, 255))
logger.debug(f"Drew right mask R{i} at centroid {c}, area={mask_area}px")
# --- Now Overdraw Matched Pairs in Green ---
for pair in matched_pairs:
left_mask = pair['left_mask']['mask']
right_mask = pair['right_mask']['mask']
# Find the indices from the stored pair data (should be available from matching)
left_idx = None
right_idx = None
# Find indices by comparing mask properties
for i, det in enumerate(left_detections):
if (np.array_equal(det['bbox'], pair['left_mask']['bbox']) and
abs(det['confidence'] - pair['left_mask']['confidence']) < 0.001):
left_idx = i
break
for i, det in enumerate(right_detections):
if (np.array_equal(det['bbox'], pair['right_mask']['bbox']) and
abs(det['confidence'] - pair['right_mask']['confidence']) < 0.001):
right_idx = i
break
# Draw left mask in green (matched)
contours, _ = cv2.findContours(left_mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
if contours:
cv2.drawContours(combined_frame, contours, -1, (0, 255, 0), 3) # Thicker green line
c1 = get_centroid(left_mask)
if c1[0] > 0 and c1[1] > 0:
draw_label(combined_frame, f"L{left_idx if left_idx is not None else '?'}", c1, (0, 255, 0))
# Draw right mask in green (matched)
contours, _ = cv2.findContours(right_mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
if contours:
for cnt in contours:
cnt[:, :, 0] += w
cv2.drawContours(combined_frame, contours, -1, (0, 255, 0), 3) # Thicker green line
c2_shifted = get_centroid(right_mask)
c2 = (c2_shifted[0] + w, c2_shifted[1])
if c2[0] > w and c2[1] > 0:
draw_label(combined_frame, f"R{right_idx if right_idx is not None else '?'}", c2, (0, 255, 0))
# Draw line connecting centroids and similarity score
cv2.line(combined_frame, c1, c2, (0, 255, 0), 2)
similarity_text = f"Sim: {pair.get('similarity', pair.get('iou', 0)):.2f}"
cv2.putText(combined_frame, similarity_text, (c1[0] + 10, c1[1] + 20), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 2)
# Add title
cv2.putText(combined_frame, title, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (255, 255, 255), 2)
cv2.imwrite(output_path, combined_frame)
logger.info(f"Saved stereo agreement debug frame to {output_path}")
except Exception as e:
logger.error(f"Failed to create stereo agreement debug frame: {e}")
def detect_and_match_stereo_pairs(self, frame: np.ndarray, confidence_reduction_factor: float,
stereo_similarity_threshold: float, segment_info: dict, save_debug_frames: bool) -> List[Dict[str, Any]]:
"""The main method to detect and match stereo mask pairs."""
frame_height, frame_width, _ = frame.shape
half_width = frame_width // 2
left_eye_frame = frame[:, :half_width]
right_eye_frame = frame[:, half_width:half_width*2] # Ensure exact same width
logger.info(f"VR180 Frame Split: Original={frame.shape}, Left={left_eye_frame.shape}, Right={right_eye_frame.shape}")
# Initial detection with validation
logger.info(f"Running initial stereo detection at {self.confidence_threshold} confidence.")
left_detections = self.detect_humans_in_frame(left_eye_frame, validate_with_detection=True)
right_detections = self.detect_humans_in_frame(right_eye_frame, validate_with_detection=True)
# Convert IOU threshold to similarity threshold (IOU 0.5 ≈ similarity 0.3)
similarity_threshold = max(0.2, stereo_similarity_threshold * 0.6)
matched_pairs, unmatched_left, unmatched_right = self._find_matching_mask_pairs(left_detections, right_detections, similarity_threshold)
if save_debug_frames:
debug_path = os.path.join(segment_info['directory'], "yolo_stereo_agreement_initial.jpg")
title = f"Initial Attempt (Conf: {self.confidence_threshold:.2f}) - {len(matched_pairs)} Pairs"
self._save_stereo_agreement_debug_frame(left_eye_frame, right_eye_frame, left_detections, right_detections, matched_pairs, unmatched_left, unmatched_right, debug_path, title)
# Retry with lower confidence if no pairs found
if not matched_pairs:
new_confidence = self.confidence_threshold * confidence_reduction_factor
logger.info(f"No valid pairs found. Reducing confidence to {new_confidence:.2f} and retrying.")
left_detections = self.detect_humans_in_frame(left_eye_frame, confidence_override=new_confidence, validate_with_detection=True)
right_detections = self.detect_humans_in_frame(right_eye_frame, confidence_override=new_confidence, validate_with_detection=True)
matched_pairs, unmatched_left, unmatched_right = self._find_matching_mask_pairs(left_detections, right_detections, similarity_threshold)
if save_debug_frames:
debug_path = os.path.join(segment_info['directory'], "yolo_stereo_agreement_retry.jpg")
title = f"Retry Attempt (Conf: {new_confidence:.2f}) - {len(matched_pairs)} Pairs"
self._save_stereo_agreement_debug_frame(left_eye_frame, right_eye_frame, left_detections, right_detections, matched_pairs, unmatched_left, unmatched_right, debug_path, title)
# Prepare final results - convert to full-frame coordinates and masks
final_prompts = []
if matched_pairs:
logger.info(f"Found {len(matched_pairs)} valid stereo pairs.")
for i, pair in enumerate(matched_pairs):
# Convert eye-specific coordinates and masks to full-frame
left_bbox_full_frame, left_mask_full_frame = self._convert_eye_to_full_frame(
pair['left_mask']['bbox'], pair['left_mask']['mask'],
'left', frame_width, frame_height
)
right_bbox_full_frame, right_mask_full_frame = self._convert_eye_to_full_frame(
pair['right_mask']['bbox'], pair['right_mask']['mask'],
'right', frame_width, frame_height
)
logger.info(f"Stereo Pair {i}: Left bbox {pair['left_mask']['bbox']} -> {left_bbox_full_frame}")
logger.info(f"Stereo Pair {i}: Right bbox {pair['right_mask']['bbox']} -> {right_bbox_full_frame}")
# Create prompts for SAM2 with full-frame coordinates and masks
final_prompts.append({
'obj_id': i * 2 + 1,
'bbox': left_bbox_full_frame,
'mask': left_mask_full_frame
})
final_prompts.append({
'obj_id': i * 2 + 2,
'bbox': right_bbox_full_frame,
'mask': right_mask_full_frame
})
else:
logger.warning("No valid stereo pairs found after all attempts.")
return final_prompts
def _convert_eye_to_full_frame(self, eye_bbox: np.ndarray, eye_mask: np.ndarray,
eye_side: str, full_frame_width: int, full_frame_height: int) -> tuple:
"""
Convert eye-specific bounding box and mask to full-frame coordinates.
Args:
eye_bbox: Bounding box in eye coordinate system
eye_mask: Mask in eye coordinate system
eye_side: 'left' or 'right'
full_frame_width: Width of the full VR180 frame
full_frame_height: Height of the full VR180 frame
Returns:
Tuple of (full_frame_bbox, full_frame_mask)
"""
half_width = full_frame_width // 2
# Convert bounding box coordinates
full_frame_bbox = eye_bbox.copy()
if eye_side == 'right':
# Shift right eye coordinates by half_width
full_frame_bbox[0] += half_width # x1
full_frame_bbox[2] += half_width # x2
# Create full-frame mask
full_frame_mask = np.zeros((full_frame_height, full_frame_width), dtype=eye_mask.dtype)
if eye_side == 'left':
# Place left eye mask in left half
eye_height, eye_width = eye_mask.shape
target_height = min(eye_height, full_frame_height)
target_width = min(eye_width, half_width)
full_frame_mask[:target_height, :target_width] = eye_mask[:target_height, :target_width]
else: # right
# Place right eye mask in right half
eye_height, eye_width = eye_mask.shape
target_height = min(eye_height, full_frame_height)
target_width = min(eye_width, half_width)
full_frame_mask[:target_height, half_width:half_width+target_width] = eye_mask[:target_height, :target_width]
logger.debug(f"Converted {eye_side} eye: bbox {eye_bbox} -> {full_frame_bbox}, "
f"mask {eye_mask.shape} -> {full_frame_mask.shape}, "
f"mask_pixels: {np.sum(eye_mask > 0.5)} -> {np.sum(full_frame_mask > 0.5)}")
return full_frame_bbox, full_frame_mask
def _validate_masks_with_detection(self, frame: np.ndarray, segmentation_detections: List[Dict[str, Any]],
confidence_override: Optional[float] = None) -> List[Dict[str, Any]]:
"""
Validate segmentation masks by checking if they overlap with detection bounding boxes.
This helps filter out spurious mask regions that aren't actually humans.
"""
if not hasattr(self, '_detection_model'):
# Load detection model for validation
try:
detection_model_path = self.model_path.replace('-seg.pt', '.pt') # Try to find detection version
if not os.path.exists(detection_model_path):
detection_model_path = "yolo11l.pt" # Fallback to default
logger.info(f"Loading detection model for validation: {detection_model_path}")
self._detection_model = YOLO(detection_model_path)
except Exception as e:
logger.warning(f"Could not load detection model for validation: {e}")
return segmentation_detections
# Run detection model
confidence = confidence_override if confidence_override is not None else self.confidence_threshold
detection_results = self._detection_model(frame, conf=confidence, verbose=False)
# Extract detection bounding boxes
detection_bboxes = []
for result in detection_results:
if result.boxes is not None:
for box in result.boxes:
cls = int(box.cls.cpu().numpy()[0])
if cls == self.human_class_id:
coords = box.xyxy[0].cpu().numpy()
conf = float(box.conf.cpu().numpy()[0])
detection_bboxes.append({'bbox': coords, 'confidence': conf})
logger.info(f"Validation: Found {len(detection_bboxes)} detection bboxes vs {len(segmentation_detections)} segmentation masks")
# Validate each segmentation mask against detection bboxes
validated_detections = []
for seg_det in segmentation_detections:
if not seg_det['has_mask']:
validated_detections.append(seg_det)
continue
# Check if this mask overlaps significantly with any detection bbox
mask = seg_det['mask']
seg_bbox = seg_det['bbox']
best_overlap = 0.0
best_detection = None
for det_bbox_info in detection_bboxes:
det_bbox = det_bbox_info['bbox']
overlap = self._calculate_bbox_overlap(seg_bbox, det_bbox)
if overlap > best_overlap:
best_overlap = overlap
best_detection = det_bbox_info
if best_overlap > 0.3: # 30% overlap threshold
logger.info(f"Validation: Segmentation mask validated (overlap={best_overlap:.3f} with detection conf={best_detection['confidence']:.3f})")
validated_detections.append(seg_det)
else:
mask_area = np.sum(mask > 0.5)
logger.warning(f"Validation: Rejecting segmentation mask with low overlap ({best_overlap:.3f}) - area={mask_area}px")
logger.info(f"Validation: Kept {len(validated_detections)}/{len(segmentation_detections)} segmentation masks")
return validated_detections
def _calculate_bbox_overlap(self, bbox1: np.ndarray, bbox2: np.ndarray) -> float:
"""Calculate the overlap ratio between two bounding boxes."""
# Calculate intersection
x1 = max(bbox1[0], bbox2[0])
y1 = max(bbox1[1], bbox2[1])
x2 = min(bbox1[2], bbox2[2])
y2 = min(bbox1[3], bbox2[3])
if x2 <= x1 or y2 <= y1:
return 0.0
intersection = (x2 - x1) * (y2 - y1)
# Calculate areas
area1 = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
area2 = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
# Return intersection over smaller area (more lenient than IoU)
return intersection / min(area1, area2) if min(area1, area2) > 0 else 0.0

587
main.py
View File

@@ -188,8 +188,295 @@ def resolve_detect_segments(detect_segments, total_segments: int) -> List[int]:
logger.warning(f"Invalid detect_segments format: {detect_segments}. Using all segments.")
return list(range(total_segments))
def main():
"""Main processing pipeline."""
def process_segment_with_separate_eyes(segment_info, detector, sam2_processor, mask_processor, config,
previous_left_masks=None, previous_right_masks=None):
"""
Process a single segment using separate eye processing mode.
Split video first, then run YOLO independently on each eye.
Args:
segment_info: Segment information dictionary
detector: YOLO detector instance
sam2_processor: SAM2 processor with eye processing enabled
mask_processor: Mask processor instance
config: Configuration loader instance
previous_left_masks: Previous masks for left eye
previous_right_masks: Previous masks for right eye
Returns:
Tuple of (success, left_masks, right_masks)
"""
segment_idx = segment_info['index']
logger.info(f"VR180 Separate Eyes: Processing segment {segment_idx} (video-split approach)")
# Get video properties
cap = cv2.VideoCapture(segment_info['video_file'])
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
cap.release()
full_frame_shape = (frame_height, frame_width)
# Step 1: Split the segment video into left and right eye videos
left_eye_video = os.path.join(segment_info['directory'], "left_eye.mp4")
right_eye_video = os.path.join(segment_info['directory'], "right_eye.mp4")
logger.info(f"VR180 Separate Eyes: Splitting segment video into eye videos")
success = sam2_processor.eye_processor.split_video_into_eyes(
segment_info['video_file'],
left_eye_video,
right_eye_video,
scale=config.get_inference_scale()
)
if not success:
logger.error(f"VR180 Separate Eyes: Failed to split video for segment {segment_idx}")
return False, None, None
# Check if both eye videos were created
if not os.path.exists(left_eye_video) or not os.path.exists(right_eye_video):
logger.error(f"VR180 Separate Eyes: Eye video files not created for segment {segment_idx}")
return False, None, None
logger.info(f"VR180 Separate Eyes: Created eye videos - left: {left_eye_video}, right: {right_eye_video}")
# Step 2: Run YOLO independently on each eye video
left_detections = detector.detect_humans_in_video_first_frame(
left_eye_video, scale=1.0 # Already scaled during video splitting
)
right_detections = detector.detect_humans_in_video_first_frame(
right_eye_video, scale=1.0 # Already scaled during video splitting
)
logger.info(f"VR180 Separate Eyes: YOLO detections - left: {len(left_detections)}, right: {len(right_detections)}")
# Check if we have YOLO segmentation masks
has_yolo_masks = False
if detector.supports_segmentation:
has_yolo_masks = any(d.get('has_mask', False) for d in (left_detections + right_detections))
if has_yolo_masks:
logger.info(f"VR180 Separate Eyes: YOLO segmentation mode - using direct masks instead of bounding boxes")
# Save eye-specific debug frames if enabled
if config.get('advanced.save_yolo_debug_frames', False) and (left_detections or right_detections):
try:
# Load first frames from each eye video
left_cap = cv2.VideoCapture(left_eye_video)
ret_left, left_frame = left_cap.read()
left_cap.release()
right_cap = cv2.VideoCapture(right_eye_video)
ret_right, right_frame = right_cap.read()
right_cap.release()
if ret_left and ret_right:
# Save eye-specific debug frames
left_debug_path = os.path.join(segment_info['directory'], "left_eye_debug.jpg")
right_debug_path = os.path.join(segment_info['directory'], "right_eye_debug.jpg")
detector.save_eye_debug_frames(
left_frame, right_frame,
left_detections, right_detections,
left_debug_path, right_debug_path
)
logger.info(f"VR180 Separate Eyes: Saved eye-specific debug frames for segment {segment_idx}")
else:
logger.warning(f"VR180 Separate Eyes: Could not load eye frames for debug visualization")
except Exception as e:
logger.warning(f"VR180 Separate Eyes: Failed to create eye debug frames: {e}")
# Step 3: Process left eye if detections exist or we have previous masks
left_masks = None
if left_detections or previous_left_masks:
try:
left_prompts = None
left_initial_masks = None
if left_detections:
if has_yolo_masks:
# YOLO segmentation mode: convert masks to initial masks for SAM2
left_initial_masks = {}
for i, detection in enumerate(left_detections):
if detection.get('has_mask', False):
mask = detection['mask']
left_initial_masks[1] = mask.astype(bool) # Always use obj_id=1 for single eye
logger.info(f"VR180 Separate Eyes: Left eye YOLO mask - shape: {mask.shape}, pixels: {np.sum(mask)}")
break # Only take the first/best mask for single eye processing
if left_initial_masks:
logger.info(f"VR180 Separate Eyes: Left eye - using YOLO segmentation masks as initial masks")
else:
# YOLO detection mode: convert bounding boxes to prompts
left_prompts = detector.convert_detections_to_sam2_prompts(left_detections, frame_width // 2)
logger.info(f"VR180 Separate Eyes: Left eye - {len(left_prompts)} SAM2 prompts")
# Create temporary segment info for left eye processing
left_segment_info = segment_info.copy()
left_segment_info['video_file'] = left_eye_video
left_masks = sam2_processor.process_single_eye_segment(
left_segment_info, 'left', left_prompts,
left_initial_masks or previous_left_masks,
1.0 # Scale already applied during video splitting
)
if left_masks:
logger.info(f"VR180 Separate Eyes: Left eye processed - {len(left_masks)} frame masks")
else:
logger.warning(f"VR180 Separate Eyes: Left eye processing failed")
except Exception as e:
logger.error(f"VR180 Separate Eyes: Error processing left eye for segment {segment_idx}: {e}")
left_masks = None
# Step 4: Process right eye if detections exist or we have previous masks
right_masks = None
if right_detections or previous_right_masks:
try:
right_prompts = None
right_initial_masks = None
if right_detections:
if has_yolo_masks:
# YOLO segmentation mode: convert masks to initial masks for SAM2
right_initial_masks = {}
for i, detection in enumerate(right_detections):
if detection.get('has_mask', False):
mask = detection['mask']
right_initial_masks[1] = mask.astype(bool) # Always use obj_id=1 for single eye
logger.info(f"VR180 Separate Eyes: Right eye YOLO mask - shape: {mask.shape}, pixels: {np.sum(mask)}")
break # Only take the first/best mask for single eye processing
if right_initial_masks:
logger.info(f"VR180 Separate Eyes: Right eye - using YOLO segmentation masks as initial masks")
else:
# YOLO detection mode: convert bounding boxes to prompts
right_prompts = detector.convert_detections_to_sam2_prompts(right_detections, frame_width // 2)
logger.info(f"VR180 Separate Eyes: Right eye - {len(right_prompts)} SAM2 prompts")
# Create temporary segment info for right eye processing
right_segment_info = segment_info.copy()
right_segment_info['video_file'] = right_eye_video
right_masks = sam2_processor.process_single_eye_segment(
right_segment_info, 'right', right_prompts,
right_initial_masks or previous_right_masks,
1.0 # Scale already applied during video splitting
)
if right_masks:
logger.info(f"VR180 Separate Eyes: Right eye processed - {len(right_masks)} frame masks")
else:
logger.warning(f"VR180 Separate Eyes: Right eye processing failed")
except Exception as e:
logger.error(f"VR180 Separate Eyes: Error processing right eye for segment {segment_idx}: {e}")
right_masks = None
# Step 5: Check if we got any valid masks
if not left_masks and not right_masks:
logger.warning(f"VR180 Separate Eyes: Neither eye produced valid masks for segment {segment_idx}")
if config.get('processing.enable_greenscreen_fallback', True):
logger.info(f"VR180 Separate Eyes: Using greenscreen fallback for segment {segment_idx}")
success = mask_processor.process_greenscreen_only_segment(
segment_info,
green_color=config.get_green_color(),
use_nvenc=config.get_use_nvenc(),
bitrate=config.get_output_bitrate()
)
return success, None, None
else:
logger.error(f"VR180 Separate Eyes: No masks generated and greenscreen fallback disabled")
return False, None, None
# Step 6: Combine masks back to full frame format
try:
logger.info(f"VR180 Separate Eyes: Combining eye masks for segment {segment_idx}")
combined_masks = sam2_processor.eye_processor.combine_eye_masks(
left_masks, right_masks, full_frame_shape
)
if not combined_masks:
logger.error(f"VR180 Separate Eyes: Failed to combine eye masks for segment {segment_idx}")
return False, left_masks, right_masks
# Validate combined masks have reasonable content
total_mask_pixels = 0
for frame_idx, frame_masks in combined_masks.items():
for obj_id, mask in frame_masks.items():
if mask is not None:
total_mask_pixels += np.sum(mask)
if total_mask_pixels == 0:
logger.warning(f"VR180 Separate Eyes: Combined masks are empty for segment {segment_idx}")
if config.get('processing.enable_greenscreen_fallback', True):
logger.info(f"VR180 Separate Eyes: Using greenscreen fallback due to empty masks")
success = mask_processor.process_greenscreen_only_segment(
segment_info,
green_color=config.get_green_color(),
use_nvenc=config.get_use_nvenc(),
bitrate=config.get_output_bitrate()
)
return success, left_masks, right_masks
logger.info(f"VR180 Separate Eyes: Combined masks contain {total_mask_pixels} total pixels")
except Exception as e:
logger.error(f"VR180 Separate Eyes: Error combining eye masks for segment {segment_idx}: {e}")
# Try greenscreen fallback if mask combination fails
if config.get('processing.enable_greenscreen_fallback', True):
logger.info(f"VR180 Separate Eyes: Using greenscreen fallback due to mask combination error")
success = mask_processor.process_greenscreen_only_segment(
segment_info,
green_color=config.get_green_color(),
use_nvenc=config.get_use_nvenc(),
bitrate=config.get_output_bitrate()
)
return success, left_masks, right_masks
else:
return False, left_masks, right_masks
# Step 7: Save combined masks
mask_path = os.path.join(segment_info['directory'], "mask.png")
sam2_processor.save_final_masks(
combined_masks,
mask_path,
green_color=config.get_green_color(),
blue_color=config.get_blue_color()
)
# Step 8: Apply green screen and save output video
success = mask_processor.process_segment(
segment_info,
combined_masks,
use_nvenc=config.get_use_nvenc(),
bitrate=config.get_output_bitrate()
)
if success:
logger.info(f"VR180 Separate Eyes: Successfully processed segment {segment_idx}")
else:
logger.error(f"VR180 Separate Eyes: Failed to create output video for segment {segment_idx}")
# Clean up temporary eye video files
try:
if os.path.exists(left_eye_video):
os.remove(left_eye_video)
if os.path.exists(right_eye_video):
os.remove(right_eye_video)
logger.debug(f"VR180 Separate Eyes: Cleaned up temporary eye videos for segment {segment_idx}")
except Exception as e:
logger.warning(f"VR180 Separate Eyes: Failed to clean up temporary eye videos: {e}")
return success, left_masks, right_masks
async def main_async():
"""Main processing pipeline with async optimizations."""
args = parse_arguments()
try:
@@ -275,10 +562,42 @@ def main():
)
logger.info("Step 3: Initializing SAM2 processor")
# Check if separate eye processing is enabled
separate_eye_processing = config.get('processing.separate_eye_processing', False)
eye_overlap_pixels = config.get('processing.eye_overlap_pixels', 0)
enable_greenscreen_fallback = config.get('processing.enable_greenscreen_fallback', True)
# Initialize async preprocessor if enabled
async_preprocessor = None
if config.get('advanced.enable_background_lowres_generation', False):
from core.async_lowres_preprocessor import AsyncLowResPreprocessor
max_concurrent = config.get('advanced.max_concurrent_lowres', 3)
segments_ahead = config.get('advanced.lowres_segments_ahead', 3)
use_ffmpeg = config.get('advanced.use_ffmpeg_lowres', True)
async_preprocessor = AsyncLowResPreprocessor(
max_concurrent=max_concurrent,
segments_ahead=segments_ahead,
use_ffmpeg=use_ffmpeg
)
logger.info(f"Async low-res preprocessing: ENABLED (max_concurrent={max_concurrent}, segments_ahead={segments_ahead})")
else:
logger.info("Async low-res preprocessing: DISABLED")
if separate_eye_processing:
logger.info("VR180 Separate Eye Processing: ENABLED")
logger.info(f"Eye overlap pixels: {eye_overlap_pixels}")
logger.info(f"Greenscreen fallback: {enable_greenscreen_fallback}")
sam2_processor = SAM2Processor(
checkpoint_path=config.get_sam2_checkpoint(),
config_path=config.get_sam2_config(),
vos_optimized=config.get('models.sam2_vos_optimized', False)
vos_optimized=config.get('models.sam2_vos_optimized', False),
separate_eye_processing=separate_eye_processing,
eye_overlap_pixels=eye_overlap_pixels,
async_preprocessor=async_preprocessor
)
# Initialize mask processor with quality enhancements
@@ -293,11 +612,34 @@ def main():
logger.info("Step 4: Processing segments sequentially")
total_humans_detected = 0
# Start background low-res video preprocessing if enabled
if async_preprocessor:
logger.info("Starting background low-res video preprocessing")
async_preprocessor.start_background_preparation(
segments_info,
config.get_inference_scale(),
separate_eye_processing,
current_segment=0
)
# Initialize previous masks for separate eye processing
previous_left_masks = None
previous_right_masks = None
for i, segment_info in enumerate(segments_info):
segment_idx = segment_info['index']
logger.info(f"Processing segment {segment_idx}/{len(segments_info)-1}")
# Start background preparation for upcoming segments
if async_preprocessor and i < len(segments_info) - 1:
async_preprocessor.start_background_preparation(
segments_info,
config.get_inference_scale(),
separate_eye_processing,
current_segment=i
)
# Reset temporal history for new segment
mask_processor.reset_temporal_history()
@@ -307,6 +649,25 @@ def main():
logger.info(f"Segment {segment_idx} already processed, skipping")
continue
# Branch based on processing mode
if separate_eye_processing:
# Use separate eye processing mode
success, left_masks, right_masks = process_segment_with_separate_eyes(
segment_info, detector, sam2_processor, mask_processor, config,
previous_left_masks, previous_right_masks
)
# Update previous masks for next segment
previous_left_masks = left_masks
previous_right_masks = right_masks
if success:
logger.info(f"Successfully processed segment {segment_idx} with separate eye processing")
else:
logger.error(f"Failed to process segment {segment_idx} with separate eye processing")
continue # Skip the original processing logic
# Determine if we should use YOLO detections or previous masks
use_detections = segment_idx in detect_segments
@@ -320,138 +681,41 @@ def main():
previous_masks = None
if use_detections:
# Run YOLO detection on current segment
logger.info(f"Running YOLO detection on segment {segment_idx}")
detection_file = os.path.join(segment_info['directory'], "yolo_detections")
# Run YOLO stereo detection and matching on current segment
logger.info(f"Running stereo pair detection on segment {segment_idx}")
# Check if detection already exists
if os.path.exists(detection_file):
logger.info(f"Loading existing YOLO detections for segment {segment_idx}")
detections = detector.load_detections_from_file(detection_file)
else:
# Run YOLO detection on first frame
detections = detector.detect_humans_in_video_first_frame(
segment_info['video_file'],
scale=config.get_inference_scale()
)
# Save detections for future runs
detector.save_detections_to_file(detections, detection_file)
if detections:
total_humans_detected += len(detections)
logger.info(f"Found {len(detections)} humans in segment {segment_idx}")
# Get frame width from video
cap = cv2.VideoCapture(segment_info['video_file'])
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
cap.release()
yolo_prompts = detector.convert_detections_to_sam2_prompts(
detections, frame_width
)
# If no right eye detections found, run debug analysis with lower confidence
half_frame_width = frame_width // 2
right_eye_detections = [d for d in detections if (d['bbox'][0] + d['bbox'][2]) / 2 >= half_frame_width]
if len(right_eye_detections) == 0 and config.get('advanced.save_yolo_debug_frames', False):
logger.info(f"VR180 Debug: No right eye detections found, running lower confidence analysis...")
# Load first frame for debug analysis
cap = cv2.VideoCapture(segment_info['video_file'])
ret, debug_frame = cap.read()
cap.release()
if ret:
# Scale frame to match detection scale
if config.get_inference_scale() != 1.0:
scale = config.get_inference_scale()
debug_frame = cv2.resize(debug_frame, None, fx=scale, fy=scale, interpolation=cv2.INTER_LINEAR)
# Run debug detection with lower confidence
debug_detections = detector.debug_detect_with_lower_confidence(debug_frame, debug_confidence=0.3)
# Analyze where these lower confidence detections are
debug_right_eye = [d for d in debug_detections if (d['bbox'][0] + d['bbox'][2]) / 2 >= half_frame_width]
if len(debug_right_eye) > 0:
logger.warning(f"VR180 Debug: Found {len(debug_right_eye)} right eye detections with lower confidence!")
for i, det in enumerate(debug_right_eye):
logger.warning(f"VR180 Debug: Right eye detection {i+1}: conf={det['confidence']:.3f}, bbox={det['bbox']}")
logger.warning(f"VR180 Debug: Consider lowering yolo_confidence from {config.get_yolo_confidence()} to 0.3-0.4")
else:
logger.info(f"VR180 Debug: No right eye detections found even with confidence 0.3")
logger.info(f"VR180 Debug: This confirms person is not visible in right eye view")
logger.info(f"Pipeline Debug: Segment {segment_idx} - Generated {len(yolo_prompts)} SAM2 prompts from {len(detections)} YOLO detections")
# Save debug frame with detections visualized (if enabled)
if config.get('advanced.save_yolo_debug_frames', False):
debug_frame_path = os.path.join(segment_info['directory'], "yolo_debug.jpg")
# Load first frame for debug visualization
cap = cv2.VideoCapture(segment_info['video_file'])
ret, debug_frame = cap.read()
cap.release()
if ret:
# Scale frame to match detection scale
if config.get_inference_scale() != 1.0:
scale = config.get_inference_scale()
debug_frame = cv2.resize(debug_frame, None, fx=scale, fy=scale, interpolation=cv2.INTER_LINEAR)
detector.save_debug_frame_with_detections(debug_frame, detections, debug_frame_path, yolo_prompts)
# Load the first frame for detection
cap = cv2.VideoCapture(segment_info['video_file'])
ret, frame = cap.read()
cap.release()
if not ret:
logger.error(f"Could not read first frame of segment {segment_idx}")
continue
# Scale frame if needed
if config.get_inference_scale() != 1.0:
frame = cv2.resize(frame, None, fx=config.get_inference_scale(), fy=config.get_inference_scale(), interpolation=cv2.INTER_LINEAR)
yolo_prompts = detector.detect_and_match_stereo_pairs(
frame,
config.get_confidence_reduction_factor(),
config.get_stereo_iou_threshold(),
segment_info,
config.get('advanced.save_yolo_debug_frames', True)
)
if not yolo_prompts:
logger.warning(f"No valid stereo pairs found for segment {segment_idx}. Attempting to use previous segment's mask.")
if segment_idx > 0:
prev_segment_dir = segments_info[segment_idx - 1]['directory']
previous_masks = sam2_processor.load_previous_segment_mask(prev_segment_dir)
if previous_masks:
logger.info(f"Using masks from segment {segment_idx - 1} as fallback.")
else:
logger.warning(f"Could not load frame for debug visualization in segment {segment_idx}")
# Check if we have YOLO masks for debug visualization
has_yolo_masks = False
if detections and detector.supports_segmentation:
has_yolo_masks = any(d.get('has_mask', False) for d in detections)
# Generate first frame masks debug (SAM2 or YOLO)
first_frame_debug_path = os.path.join(segment_info['directory'], "first_frame_detection.jpg")
if has_yolo_masks:
logger.info(f"Pipeline Debug: Generating YOLO first frame masks for segment {segment_idx}")
# Create YOLO mask debug visualization
create_yolo_mask_debug_frame(detections, segment_info['video_file'], first_frame_debug_path, config.get_inference_scale())
else:
logger.info(f"Pipeline Debug: Generating SAM2 first frame masks for segment {segment_idx}")
sam2_processor.generate_first_frame_debug_masks(
segment_info['video_file'],
yolo_prompts,
first_frame_debug_path,
config.get_inference_scale()
)
else:
logger.warning(f"No humans detected in segment {segment_idx}")
# Save debug frame even when no detections (if enabled)
if config.get('advanced.save_yolo_debug_frames', False):
debug_frame_path = os.path.join(segment_info['directory'], "yolo_debug_no_detections.jpg")
# Load first frame for debug visualization
cap = cv2.VideoCapture(segment_info['video_file'])
ret, debug_frame = cap.read()
cap.release()
if ret:
# Scale frame to match detection scale
if config.get_inference_scale() != 1.0:
scale = config.get_inference_scale()
debug_frame = cv2.resize(debug_frame, None, fx=scale, fy=scale, interpolation=cv2.INTER_LINEAR)
# Add "No detections" text overlay
cv2.putText(debug_frame, "YOLO: No humans detected",
(10, 30),
cv2.FONT_HERSHEY_SIMPLEX, 1.0,
(0, 0, 255), 2) # Red text
cv2.imwrite(debug_frame_path, debug_frame)
logger.info(f"Saved no-detection debug frame to {debug_frame_path}")
else:
logger.warning(f"Could not load frame for no-detection debug visualization in segment {segment_idx}")
logger.error(f"Fallback failed: No previous mask found for segment {segment_idx}.")
else:
logger.error("Cannot use fallback for the first segment.")
elif segment_idx > 0:
# Try to load previous segment mask
for j in range(segment_idx - 1, -1, -1):
@@ -465,43 +729,20 @@ def main():
logger.error(f"No prompts or previous masks available for segment {segment_idx}")
continue
# Check if we have YOLO masks and can skip SAM2 (recheck in case detections were loaded from file)
if not 'has_yolo_masks' in locals():
has_yolo_masks = False
if detections and detector.supports_segmentation:
has_yolo_masks = any(d.get('has_mask', False) for d in detections)
# Check if we have YOLO masks from the stereo pair matching and can use them as initial masks for SAM2
if yolo_prompts and detector.supports_segmentation:
logger.info(f"Pipeline Debug: YOLO segmentation provided matched stereo masks - using as SAM2 initial masks.")
if has_yolo_masks:
logger.info(f"Pipeline Debug: YOLO segmentation provided masks - using as SAM2 initial masks for segment {segment_idx}")
# Convert the prompts (which contain masks) into the initial_masks format for SAM2
initial_masks = {prompt['obj_id']: prompt['mask'] for prompt in yolo_prompts if 'mask' in prompt}
# Convert YOLO masks to initial masks for SAM2
cap = cv2.VideoCapture(segment_info['video_file'])
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
cap.release()
# Convert YOLO masks to the format expected by SAM2 add_previous_masks_to_predictor
yolo_masks_dict = {}
for i, detection in enumerate(detections[:2]): # Up to 2 objects
if detection.get('has_mask', False):
mask = detection['mask']
# Resize mask to match inference scale
if config.get_inference_scale() != 1.0:
scale = config.get_inference_scale()
scaled_height = int(frame_height * scale)
scaled_width = int(frame_width * scale)
mask = cv2.resize(mask.astype(np.float32), (scaled_width, scaled_height), interpolation=cv2.INTER_NEAREST)
mask = mask > 0.5
obj_id = i + 1 # Sequential object IDs
yolo_masks_dict[obj_id] = mask.astype(bool)
logger.info(f"Pipeline Debug: YOLO mask for Object {obj_id} - shape: {mask.shape}, pixels: {np.sum(mask)}")
logger.info(f"Pipeline Debug: Using YOLO masks as SAM2 initial masks - {len(yolo_masks_dict)} objects")
# Use traditional SAM2 pipeline with YOLO masks as initial masks
previous_masks = yolo_masks_dict
yolo_prompts = None # Don't use bounding box prompts when we have masks
if initial_masks:
# We are providing initial masks, so we should not provide bbox prompts
previous_masks = initial_masks
yolo_prompts = None
logger.info(f"Pipeline Debug: Using {len(previous_masks)} YOLO masks as SAM2 initial masks.")
else:
logger.warning("YOLO segmentation mode is on, but no masks were found in the final prompts.")
# Debug what we're passing to SAM2
if yolo_prompts:
@@ -689,6 +930,16 @@ def main():
except Exception as e:
logger.error(f"Pipeline failed: {e}", exc_info=True)
return 1
finally:
# Cleanup async preprocessor if it was used
if async_preprocessor:
async_preprocessor.cleanup()
logger.debug("Async preprocessor cleanup completed")
def main():
"""Main entry point - wrapper for async main."""
import asyncio
return asyncio.run(main_async())
if __name__ == "__main__":
exit_code = main()

198
sbs_spec.md Normal file
View File

@@ -0,0 +1,198 @@
# Plan: Separate Left/Right Eye Processing for VR180 SAM2 Pipeline
## Overview
Implement a new processing mode that splits VR180 side-by-side frames into separate left and right halves, processes each eye independently through SAM2, then recombines them into the final output. This should improve tracking accuracy by removing parallax confusion between eyes.
## Key Changes Required
### 1. Configuration Updates
**File: `config.yaml`**
- Add new configuration option: `processing.separate_eye_processing: false` (default off for backward compatibility)
- Add related options:
- `processing.enable_greenscreen_fallback: true` (render full green if no humans detected)
- `processing.eye_overlap_pixels: 0` (optional overlap for blending)
### 2. Core SAM2 Processor Enhancements
**File: `core/sam2_processor.py`**
#### New Methods:
- `split_frame_into_eyes(frame) -> (left_frame, right_frame)`
- `split_video_into_eyes(video_path, left_output, right_output, scale)`
- `process_single_eye_segment(segment_info, eye_side, yolo_prompts, previous_masks, inference_scale)`
- `combine_eye_masks(left_masks, right_masks, full_frame_shape) -> combined_masks`
- `create_greenscreen_segment(segment_info, duration_seconds) -> bool`
#### Modified Methods:
- `process_single_segment()` - Add branch for separate eye processing mode
- New processing flow:
1. Check if separate_eye_processing enabled
2. If enabled: split segment video into left/right eye videos
3. Process each eye independently with SAM2
4. Combine masks back to full frame format
5. If fallback needed: create full greenscreen segment
### 3. YOLO Detector Enhancements
**File: `core/yolo_detector.py`**
#### New Methods:
- `detect_humans_in_single_eye(frame, eye_side) -> List[Dict]`
- `convert_eye_detections_to_sam2_prompts(detections, eye_side) -> List[Dict]`
- `has_any_detections(detections_list) -> bool`
#### Modified Methods:
- `detect_humans_in_video_first_frame()` - Add eye-specific detection support
- Object ID assignment: Always use obj_id=1 for single-eye processing (since each eye is processed independently)
### 4. Mask Processor Updates
**File: `core/mask_processor.py`**
#### New Methods:
- `create_full_greenscreen_frame(frame_shape) -> np.ndarray`
- `process_greenscreen_only_segment(segment_info, frame_count) -> bool`
#### Modified Methods:
- `apply_green_mask()` - Handle combined eye masks properly
- Add support for full-greenscreen fallback when no humans detected
### 5. Main Pipeline Integration
**File: `main.py`**
#### Processing Flow Changes:
```python
# For each segment:
if config.get('processing.separate_eye_processing', False):
# 1. Run YOLO on full frame to check for ANY human presence
full_frame_detections = detector.detect_humans_in_video_first_frame(segment_video)
if not full_frame_detections:
# No humans detected anywhere - create full greenscreen segment
success = mask_processor.process_greenscreen_only_segment(segment_info, expected_frame_count)
continue
# 2. Split detections by eye and process separately
left_detections = [d for d in full_frame_detections if is_in_left_half(d, frame_width)]
right_detections = [d for d in full_frame_detections if is_in_right_half(d, frame_width)]
# 3. Process left eye (if detections exist)
left_masks = None
if left_detections:
left_eye_prompts = detector.convert_eye_detections_to_sam2_prompts(left_detections, 'left')
left_masks = sam2_processor.process_single_eye_segment(segment_info, 'left', left_eye_prompts, previous_left_masks, inference_scale)
# 4. Process right eye (if detections exist)
right_masks = None
if right_detections:
right_eye_prompts = detector.convert_eye_detections_to_sam2_prompts(right_detections, 'right')
right_masks = sam2_processor.process_single_eye_segment(segment_info, 'right', right_eye_prompts, previous_right_masks, inference_scale)
# 5. Combine masks back to full frame format
if left_masks or right_masks:
combined_masks = sam2_processor.combine_eye_masks(left_masks, right_masks, full_frame_shape)
# Continue with normal mask processing...
else:
# Neither eye had trackable humans - full greenscreen fallback
success = mask_processor.process_greenscreen_only_segment(segment_info, expected_frame_count)
else:
# Original processing mode (current behavior)
# ... existing logic unchanged
```
### 6. File Structure Changes
#### New Files:
- `core/eye_processor.py` - Dedicated class for eye-specific operations
- `utils/video_utils.py` - Video manipulation utilities (splitting, combining)
#### Modified Files:
- All core processing modules as detailed above
- Update logging to distinguish left/right eye processing
- Update debug frame generation for eye-specific visualization
### 7. Debug and Monitoring Enhancements
#### Debug Outputs:
- `left_eye_debug.jpg` - Left eye YOLO detections
- `right_eye_debug.jpg` - Right eye YOLO detections
- `left_eye_sam2_masks.jpg` - Left eye SAM2 results
- `right_eye_sam2_masks.jpg` - Right eye SAM2 results
- `combined_masks_debug.jpg` - Final combined result
#### Logging Enhancements:
- Clear distinction between left/right eye processing stages
- Performance metrics for each eye processing
- Fallback trigger logging when no humans detected
### 8. Performance Considerations
#### Optimizations:
- **Parallel Processing**: Process left and right eyes simultaneously using threading
- **Selective Processing**: Skip SAM2 for eyes with no YOLO detections
- **Memory Management**: Clean up intermediate eye videos promptly
- **Caching**: Cache split eye videos if processing multiple segments
#### Resource Usage:
- **Memory**: ~2x peak usage during eye processing (temporary)
- **Storage**: Temporary left/right eye videos (~1.5x original size)
- **Compute**: Potentially faster overall due to smaller frame processing
### 9. Backward Compatibility
#### Default Behavior:
- `separate_eye_processing: false` by default
- Existing configurations work unchanged
- All current functionality preserved
#### Migration Path:
- Users can gradually test new mode on problematic segments
- Configuration flag allows easy A/B testing
- Existing debug outputs remain functional
### 10. Error Handling and Fallbacks
#### Robust Error Recovery:
- If eye splitting fails → fall back to original processing
- If single eye SAM2 fails → use greenscreen for that eye
- If both eyes fail → full greenscreen segment
- Comprehensive logging of all fallback triggers
#### Quality Validation:
- Verify combined masks have reasonable pixel counts
- Check for mask alignment issues between eyes
- Validate segment completeness before marking done
## Implementation Priority
### Phase 1 (Core Functionality)
1. Configuration schema updates
2. Basic eye splitting and recombining logic
3. Modified SAM2 processor with separate eye support
4. Greenscreen fallback implementation
### Phase 2 (Integration)
1. Main pipeline integration with new processing mode
2. YOLO detector eye-specific enhancements
3. Mask processor updates for combined masks
4. Basic error handling and fallbacks
### Phase 3 (Polish)
1. Performance optimizations (parallel processing)
2. Enhanced debug outputs and logging
3. Comprehensive testing and validation
4. Documentation updates
## Expected Benefits
### Tracking Improvements:
- **Eliminated Parallax Confusion**: SAM2 processes single viewpoint per eye
- **Better Object Consistency**: Single object tracking per eye view
- **Improved Temporal Coherence**: Less cross-eye interference
- **Reduced False Positives**: Eye-specific context for tracking
### Operational Benefits:
- **Graceful Degradation**: Full greenscreen when humans not detected
- **Flexible Processing**: Can enable/disable per pipeline
- **Better Debug Visibility**: Eye-specific debug outputs
- **Performance Scalability**: Smaller frames = faster processing per eye
This plan maintains full backward compatibility while adding the requested separate eye processing capability with robust fallback mechanisms.

View File

@@ -0,0 +1,122 @@
# YOLO + SAM2 Video Processing Configuration with VR180 Separate Eye Processing
input:
video_path: "./input/regrets_full.mp4"
output:
directory: "./output/"
filename: "vr180_processed_both_eyes.mp4"
processing:
# Duration of each video segment in seconds
segment_duration: 5
# Scale factor for SAM2 inference (0.5 = half resolution)
inference_scale: 0.4
# YOLO detection confidence threshold (lowered for better VR180 detection)
yolo_confidence: 0.4
# Which segments to run YOLO detection on
detect_segments: "all"
# VR180 separate eye processing mode (ENABLED FOR TESTING)
separate_eye_processing: false
# Enable full greenscreen fallback when no humans detected
# A value of 0.5 means masks must overlap by 50% to be considered a pair.
stereo_iou_threshold: 0.5
# Factor to reduce YOLO confidence by if no stereo pairs are found on the first try (e.g., 0.8 = 20% reduction).
confidence_reduction_factor: 0.8
# If no humans are detected in a segment, create a full green screen video.
# Only used when separate_eye_processing is true.
enable_greenscreen_fallback: true
# Pixel overlap between left/right eyes for blending (0 = no overlap)
eye_overlap_pixels: 0
models:
# YOLO detection mode: "detection" (bounding boxes) or "segmentation" (direct masks)
yolo_mode: "segmentation" # Default: existing behavior, Options: "detection", "segmentation"
# YOLO model paths for different modes
yolo_detection_model: "models/yolo/yolo11l.pt" # Regular YOLO for detection mode
yolo_segmentation_model: "models/yolo/yolo11x-seg.pt" # Segmentation YOLO for segmentation mode
# SAM2 model configuration
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_small.pt"
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_s.yaml"
video:
# Use NVIDIA hardware encoding (requires NVENC-capable GPU)
use_nvenc: true
# Output video bitrate
output_bitrate: "25M"
# Preserve original audio track
preserve_audio: true
# Force keyframes for better segment boundaries
force_keyframes: true
advanced:
# Green screen color (RGB values)
green_color: [0, 255, 0]
# Blue screen color for second object (RGB values)
blue_color: [255, 0, 0]
# YOLO human class ID (0 for COCO person class)
human_class_id: 0
# GPU memory management
cleanup_intermediate_files: true
# Logging level (DEBUG, INFO, WARNING, ERROR)
log_level: "INFO"
# Save debug frames with YOLO detections visualized (ENABLED FOR TESTING)
save_yolo_debug_frames: true
# --- Mid-Segment Re-detection ---
# Re-run YOLO at intervals within a segment to correct tracking drift.
enable_mid_segment_detection: false
redetection_interval: 30 # Frames between re-detections.
max_redetections_per_segment: 10
# Parallel Processing Optimizations
enable_background_lowres_generation: false # Enable async low-res video pre-generation (temporarily disabled due to syntax fix needed)
max_concurrent_lowres: 2 # Max parallel FFmpeg processes for low-res creation
lowres_segments_ahead: 2 # How many segments to prepare in advance
use_ffmpeg_lowres: true # Use FFmpeg instead of OpenCV for low-res creation
# Mask Quality Enhancement Settings - Optimized for Performance
mask_processing:
# Edge feathering and blurring (REDUCED for performance)
enable_edge_blur: true # Enable Gaussian blur on mask edges for smooth transitions
edge_blur_radius: 3 # Reduced from 10 to 3 for better performance
edge_blur_sigma: 0.5 # Gaussian blur standard deviation
# Temporal smoothing between frames
enable_temporal_smoothing: false # Enable frame-to-frame mask blending
temporal_blend_weight: 0.2 # Weight for previous frame (0.0-1.0, higher = more smoothing)
temporal_history_frames: 2 # Number of previous frames to consider
# Morphological mask cleaning (DISABLED for VR180 - SAM2 masks are already high quality)
enable_morphological_cleaning: false # Disabled for performance - SAM2 produces clean masks
morphology_kernel_size: 5 # Kernel size for opening/closing operations
min_component_size: 500 # Minimum pixel area for connected components
# Alpha blending mode (OPTIMIZED)
alpha_blending_mode: "linear" # Linear is fastest - keep as-is
alpha_transition_width: 1 # Width of transition zone in pixels
# Advanced options
enable_bilateral_filter: false # Edge-preserving smoothing (slower but higher quality)
bilateral_d: 9 # Bilateral filter diameter
bilateral_sigma_color: 75 # Bilateral filter color sigma
bilateral_sigma_space: 75 # Bilateral filter space sigma