stereo mask working
This commit is contained in:
230
claude.md
Normal file
230
claude.md
Normal file
@@ -0,0 +1,230 @@
|
|||||||
|
# YOLO + SAM2 VR180 Video Processing Pipeline - LLM Guide
|
||||||
|
|
||||||
|
## Project Overview
|
||||||
|
|
||||||
|
This repository implements an automated video processing pipeline specifically designed for **VR180 side-by-side stereo videos**. The system detects and segments humans in video content, replacing backgrounds with green screen for post-production compositing. The pipeline is optimized for long VR videos by splitting them into manageable segments, processing each segment independently, and then reassembling the final output.
|
||||||
|
|
||||||
|
## Core Purpose
|
||||||
|
|
||||||
|
The primary goal is to automatically create green screen videos from VR180 content where:
|
||||||
|
- **Left eye view** (left half of frame) contains humans as Object 1 (green masks)
|
||||||
|
- **Right eye view** (right half of frame) contains humans as Object 2 (blue masks)
|
||||||
|
- Background is replaced with pure green (RGB: 0,255,0) for chroma keying
|
||||||
|
- Original audio is preserved throughout the process
|
||||||
|
- Processing handles videos of any length through segmentation
|
||||||
|
|
||||||
|
## Architecture Overview
|
||||||
|
|
||||||
|
### Pipeline Stages
|
||||||
|
|
||||||
|
1. **Video Segmentation** (`core/video_splitter.py`)
|
||||||
|
- Splits long videos into 5-second segments using FFmpeg
|
||||||
|
- Creates organized directory structure: `segment_0/`, `segment_1/`, etc.
|
||||||
|
- Preserves timestamps and forces keyframes for clean cuts
|
||||||
|
|
||||||
|
2. **Human Detection** (`core/yolo_detector.py`)
|
||||||
|
- Uses YOLOv8 for robust human detection in VR180 format
|
||||||
|
- Supports both detection mode (bounding boxes) and segmentation mode (direct masks)
|
||||||
|
- Automatically assigns humans to left/right eye based on position in frame
|
||||||
|
- Saves detection results for reuse and debugging
|
||||||
|
|
||||||
|
3. **Mask Generation** (`core/sam2_processor.py`)
|
||||||
|
- Uses Meta's SAM2 (Segment Anything Model 2) for precise segmentation
|
||||||
|
- Propagates masks across all frames in each segment
|
||||||
|
- Supports mask continuity between segments using previous segment's final masks
|
||||||
|
- Handles VR180 stereo tracking with separate object IDs for each eye
|
||||||
|
|
||||||
|
4. **Green Screen Processing** (`core/mask_processor.py`)
|
||||||
|
- Applies generated masks to isolate humans
|
||||||
|
- Replaces background with green screen
|
||||||
|
- Uses GPU acceleration (CuPy) for fast processing
|
||||||
|
- Maintains original video quality and framerate
|
||||||
|
|
||||||
|
5. **Video Assembly** (`core/video_assembler.py`)
|
||||||
|
- Concatenates all processed segments into final video
|
||||||
|
- Preserves original audio track from input video
|
||||||
|
- Uses hardware encoding (NVENC) when available
|
||||||
|
|
||||||
|
### Key Components
|
||||||
|
|
||||||
|
```
|
||||||
|
samyolo_on_segments/
|
||||||
|
├── main.py # Entry point - orchestrates the pipeline
|
||||||
|
├── config.yaml # Configuration file (YAML format)
|
||||||
|
├── core/ # Core processing modules
|
||||||
|
│ ├── config_loader.py # Configuration management
|
||||||
|
│ ├── video_splitter.py # FFmpeg-based video segmentation
|
||||||
|
│ ├── yolo_detector.py # YOLO human detection (detection/segmentation modes)
|
||||||
|
│ ├── sam2_processor.py # SAM2 mask generation and propagation
|
||||||
|
│ ├── mask_processor.py # Green screen application
|
||||||
|
│ └── video_assembler.py # Final video concatenation
|
||||||
|
├── utils/ # Utility functions
|
||||||
|
│ ├── file_utils.py # File system operations
|
||||||
|
│ ├── logging_utils.py # Logging configuration
|
||||||
|
│ └── status_utils.py # Progress monitoring
|
||||||
|
└── models/ # Model storage (created by download_models.py)
|
||||||
|
├── sam2/ # SAM2 checkpoints and configs
|
||||||
|
└── yolo/ # YOLO model weights
|
||||||
|
```
|
||||||
|
|
||||||
|
## VR180 Specific Features
|
||||||
|
|
||||||
|
### Stereo Video Handling
|
||||||
|
- Automatically detects humans in left and right eye views
|
||||||
|
- Assigns Object ID 1 to left eye humans (green masks)
|
||||||
|
- Assigns Object ID 2 to right eye humans (blue masks)
|
||||||
|
- Maintains stereo correspondence throughout segments
|
||||||
|
|
||||||
|
### Frame Division Logic
|
||||||
|
- Frame width is divided in half to separate left/right views
|
||||||
|
- Human detection centers are used to determine eye assignment
|
||||||
|
- If only one human is detected, it may be duplicated to both eyes (configurable)
|
||||||
|
|
||||||
|
## Configuration System
|
||||||
|
|
||||||
|
The pipeline is controlled via `config.yaml` with these key sections:
|
||||||
|
|
||||||
|
### Essential Settings
|
||||||
|
```yaml
|
||||||
|
input:
|
||||||
|
video_path: "/path/to/vr180_video.mp4"
|
||||||
|
|
||||||
|
output:
|
||||||
|
directory: "/path/to/output/"
|
||||||
|
filename: "greenscreen_output.mp4"
|
||||||
|
|
||||||
|
processing:
|
||||||
|
segment_duration: 5 # Seconds per segment
|
||||||
|
inference_scale: 0.5 # Scale for faster processing
|
||||||
|
yolo_confidence: 0.6 # Detection threshold
|
||||||
|
detect_segments: "all" # Which segments to process
|
||||||
|
|
||||||
|
models:
|
||||||
|
yolo_model: "models/yolo/yolov8n.pt"
|
||||||
|
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_large.pt"
|
||||||
|
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_l.yaml"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Advanced Options
|
||||||
|
- **YOLO Modes**: Switch between detection (bboxes) and segmentation (direct masks)
|
||||||
|
- **Mid-segment Detection**: Re-detect humans at intervals within segments
|
||||||
|
- **Mask Quality**: Temporal smoothing, morphological operations, edge refinement
|
||||||
|
- **Debug Outputs**: Save detection visualizations and first-frame masks
|
||||||
|
|
||||||
|
## Processing Flow
|
||||||
|
|
||||||
|
### For First Segment (segment_0):
|
||||||
|
1. Load first frame at inference scale
|
||||||
|
2. Run YOLO to detect humans
|
||||||
|
3. Convert detections to SAM2 prompts (or use YOLO masks directly)
|
||||||
|
4. Initialize SAM2 with prompts/masks
|
||||||
|
5. Propagate masks through all frames
|
||||||
|
6. Apply green screen and save output
|
||||||
|
7. Save final mask for next segment
|
||||||
|
|
||||||
|
### For Subsequent Segments:
|
||||||
|
1. Check if YOLO detection is requested for this segment
|
||||||
|
2. If yes: Use YOLO detection (same as first segment)
|
||||||
|
3. If no: Load previous segment's final mask
|
||||||
|
4. Initialize SAM2 with previous masks
|
||||||
|
5. Continue propagation through segment
|
||||||
|
6. Apply green screen and save output
|
||||||
|
|
||||||
|
### Fallback Logic:
|
||||||
|
- If no previous mask exists, searches backwards through segments
|
||||||
|
- First segment always requires YOLO detection
|
||||||
|
- Missing detections can be recovered in later segments
|
||||||
|
|
||||||
|
## Model Support
|
||||||
|
|
||||||
|
### YOLO Models
|
||||||
|
- **Detection**: yolov8n.pt, yolov8s.pt, yolov8m.pt (bounding boxes only)
|
||||||
|
- **Segmentation**: yolov8n-seg.pt, yolov8s-seg.pt (direct mask output)
|
||||||
|
|
||||||
|
### SAM2 Models
|
||||||
|
- **Tiny**: sam2.1_hiera_tiny.pt (fastest, lowest quality)
|
||||||
|
- **Small**: sam2.1_hiera_small.pt
|
||||||
|
- **Base+**: sam2.1_hiera_base_plus.pt
|
||||||
|
- **Large**: sam2.1_hiera_large.pt (best quality, slowest)
|
||||||
|
|
||||||
|
## Key Implementation Details
|
||||||
|
|
||||||
|
### GPU Optimization
|
||||||
|
- CUDA device selection with MPS fallback
|
||||||
|
- CuPy for GPU-accelerated mask operations
|
||||||
|
- NVENC hardware encoding support
|
||||||
|
- Batch processing where possible
|
||||||
|
|
||||||
|
### Memory Management
|
||||||
|
- Segments processed sequentially to limit memory usage
|
||||||
|
- Explicit garbage collection between segments
|
||||||
|
- Low-resolution inference with high-resolution rendering
|
||||||
|
- Configurable scale factors for different stages
|
||||||
|
|
||||||
|
### Error Handling
|
||||||
|
- Graceful fallback when masks are unavailable
|
||||||
|
- Segment-level recovery (can restart individual segments)
|
||||||
|
- Comprehensive logging at all stages
|
||||||
|
- Status checking and cleanup utilities
|
||||||
|
|
||||||
|
## Debugging Features
|
||||||
|
|
||||||
|
### Status Monitoring
|
||||||
|
```bash
|
||||||
|
python main.py --config config.yaml --status
|
||||||
|
```
|
||||||
|
|
||||||
|
### Segment Cleanup
|
||||||
|
```bash
|
||||||
|
python main.py --config config.yaml --cleanup-segment 5
|
||||||
|
```
|
||||||
|
|
||||||
|
### Debug Outputs
|
||||||
|
- `yolo_debug.jpg`: Bounding box visualizations
|
||||||
|
- `first_frame_detection.jpg`: Initial mask visualization
|
||||||
|
- `mask.png`: Final segment mask for continuity
|
||||||
|
- `yolo_detections`: Saved detection coordinates
|
||||||
|
|
||||||
|
## Common Issues and Solutions
|
||||||
|
|
||||||
|
### No Right Eye Detections in VR180
|
||||||
|
- Lower `yolo_confidence` threshold (try 0.3-0.4)
|
||||||
|
- Enable debug mode to analyze detection confidence
|
||||||
|
- Check if person is actually visible in right eye view
|
||||||
|
|
||||||
|
### Mask Propagation Failures
|
||||||
|
- Ensure first segment has successful YOLO detections
|
||||||
|
- Check previous segment's mask.png exists
|
||||||
|
- Consider re-running YOLO on problem segments
|
||||||
|
|
||||||
|
### Memory Issues
|
||||||
|
- Reduce `inference_scale` (try 0.25)
|
||||||
|
- Use smaller models (tiny/small variants)
|
||||||
|
- Process fewer segments at once
|
||||||
|
|
||||||
|
## Development Notes
|
||||||
|
|
||||||
|
### Adding Features
|
||||||
|
- All core modules inherit from base classes in `core/`
|
||||||
|
- Configuration is centralized through `ConfigLoader`
|
||||||
|
- Logging uses Python's standard logging module
|
||||||
|
- File operations go through `utils/file_utils.py`
|
||||||
|
|
||||||
|
### Testing Components
|
||||||
|
- Each module can be tested independently
|
||||||
|
- Use `--status` flag to check processing state
|
||||||
|
- Debug outputs help verify each stage
|
||||||
|
|
||||||
|
### Performance Tuning
|
||||||
|
- Adjust `inference_scale` for speed vs quality
|
||||||
|
- Use `detect_segments` to process only key frames
|
||||||
|
- Enable `use_nvenc` for hardware encoding
|
||||||
|
- Consider `vos_optimized` mode for SAM2 (experimental)
|
||||||
|
|
||||||
|
## Original Monolithic Script
|
||||||
|
|
||||||
|
The project includes the original working script in `spec.md` (lines 200-811) as a reference implementation. This script works but processes videos monolithically. The current modular architecture maintains the same core logic while adding:
|
||||||
|
- Better error handling and recovery
|
||||||
|
- Configurable processing pipeline
|
||||||
|
- Debug and monitoring capabilities
|
||||||
|
- Cleaner code organization
|
||||||
110
config.yaml
110
config.yaml
@@ -1,75 +1,137 @@
|
|||||||
# YOLO + SAM2 Video Processing Configuration
|
# YOLO + SAM2 Video Processing Configuration
|
||||||
|
# This file serves as a complete reference for all available settings.
|
||||||
|
|
||||||
input:
|
input:
|
||||||
|
# Full path to the input video file.
|
||||||
video_path: "/path/to/input/video.mp4"
|
video_path: "/path/to/input/video.mp4"
|
||||||
|
|
||||||
output:
|
output:
|
||||||
|
# Directory where all output files and segments will be stored.
|
||||||
directory: "/path/to/output/"
|
directory: "/path/to/output/"
|
||||||
|
# Filename for the final assembled video.
|
||||||
filename: "processed_video.mp4"
|
filename: "processed_video.mp4"
|
||||||
|
|
||||||
processing:
|
processing:
|
||||||
# Duration of each video segment in seconds
|
# Duration of each video segment in seconds. Shorter segments use less memory.
|
||||||
segment_duration: 5
|
segment_duration: 5
|
||||||
|
|
||||||
# Scale factor for SAM2 inference (0.5 = half resolution)
|
# Scale factor for SAM2 inference (e.g., 0.5 = half resolution).
|
||||||
|
# Lower values are faster but may reduce mask quality.
|
||||||
inference_scale: 0.5
|
inference_scale: 0.5
|
||||||
|
|
||||||
# YOLO detection confidence threshold
|
# YOLO detection confidence threshold (0.0 to 1.0).
|
||||||
yolo_confidence: 0.6
|
yolo_confidence: 0.6
|
||||||
|
|
||||||
# Which segments to run YOLO detection on
|
# Which segments to run YOLO detection on.
|
||||||
# Options: "all", [0, 5, 10], or [] for default (all)
|
# Options: "all", a list of specific segment indices (e.g., [0, 10, 20]), or [] for default ("all").
|
||||||
detect_segments: "all"
|
detect_segments: "all"
|
||||||
|
|
||||||
# VR180 separate eye processing mode (default: false for backward compatibility)
|
# --- VR180 Stereo Processing ---
|
||||||
|
# Enables special logic for VR180 SBS video. When false, video is treated as a single view.
|
||||||
separate_eye_processing: false
|
separate_eye_processing: false
|
||||||
|
|
||||||
# Enable full greenscreen fallback when no humans detected (only used with separate_eye_processing)
|
# Threshold for stereo mask agreement (Intersection over Union).
|
||||||
|
# A value of 0.5 means masks must overlap by 50% to be considered a pair.
|
||||||
|
stereo_iou_threshold: 0.5
|
||||||
|
|
||||||
|
# Factor to reduce YOLO confidence by if no stereo pairs are found on the first try (e.g., 0.8 = 20% reduction).
|
||||||
|
confidence_reduction_factor: 0.8
|
||||||
|
|
||||||
|
# If no humans are detected in a segment, create a full green screen video.
|
||||||
|
# Only used when separate_eye_processing is true.
|
||||||
enable_greenscreen_fallback: true
|
enable_greenscreen_fallback: true
|
||||||
|
|
||||||
# Pixel overlap between left/right eyes for blending (optional, default: 0)
|
# Pixel overlap between left/right eyes for smoother blending at the center seam.
|
||||||
eye_overlap_pixels: 0
|
eye_overlap_pixels: 0
|
||||||
|
|
||||||
models:
|
models:
|
||||||
# YOLO detection mode: "detection" (bounding boxes) or "segmentation" (direct masks)
|
# YOLO mode: "detection" (for bounding boxes) or "segmentation" (for direct masks).
|
||||||
yolo_mode: "segmentation" # Default: existing behavior, Options: "detection", "segmentation"
|
# "segmentation" is generally recommended as it provides initial masks to SAM2.
|
||||||
|
yolo_mode: "segmentation"
|
||||||
|
|
||||||
# YOLO model paths for different modes
|
# Path to the YOLO model for "detection" mode.
|
||||||
yolo_detection_model: "models/yolo/yolo11l.pt" # Regular YOLO for detection mode
|
yolo_detection_model: "models/yolo/yolo11l.pt"
|
||||||
yolo_segmentation_model: "models/yolo/yolo11x-seg.pt" # Segmentation YOLO for segmentation mode
|
# Path to the YOLO model for "segmentation" mode.
|
||||||
|
yolo_segmentation_model: "models/yolo/yolo11x-seg.pt"
|
||||||
|
|
||||||
# SAM2 model configuration
|
# --- SAM2 Model Configuration ---
|
||||||
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_small.pt"
|
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_small.pt"
|
||||||
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_s.yaml"
|
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_s.yaml"
|
||||||
|
# (Experimental) Use optimized VOS predictor for a significant speedup. Requires PyTorch 2.5.1+.
|
||||||
|
sam2_vos_optimized: false
|
||||||
|
|
||||||
video:
|
video:
|
||||||
# Use NVIDIA hardware encoding (requires NVENC-capable GPU)
|
# Use NVIDIA's NVENC for hardware-accelerated video encoding.
|
||||||
use_nvenc: true
|
use_nvenc: true
|
||||||
|
|
||||||
# Output video bitrate
|
# Bitrate for the output video (e.g., "25M", "50M").
|
||||||
output_bitrate: "50M"
|
output_bitrate: "50M"
|
||||||
|
|
||||||
# Preserve original audio track
|
# If true, the audio track from the input video will be copied to the final output.
|
||||||
preserve_audio: true
|
preserve_audio: true
|
||||||
|
|
||||||
# Force keyframes for better segment boundaries
|
# Force keyframes at the start of each segment for clean cuts. Recommended to keep true.
|
||||||
force_keyframes: true
|
force_keyframes: true
|
||||||
|
|
||||||
advanced:
|
advanced:
|
||||||
# Green screen color (RGB values)
|
# RGB color for the green screen background.
|
||||||
green_color: [0, 255, 0]
|
green_color: [0, 255, 0]
|
||||||
|
|
||||||
# Blue screen color for second object (RGB values)
|
# RGB color for the second object's mask (typically the right eye in VR180).
|
||||||
blue_color: [255, 0, 0]
|
blue_color: [255, 0, 0]
|
||||||
|
|
||||||
# YOLO human class ID (0 for COCO person class)
|
# The class ID for humans in the YOLO model (COCO default is 0 for "person").
|
||||||
human_class_id: 0
|
human_class_id: 0
|
||||||
|
|
||||||
# GPU memory management
|
# If true, deletes intermediate files like segment videos after processing.
|
||||||
cleanup_intermediate_files: true
|
cleanup_intermediate_files: true
|
||||||
|
|
||||||
# Logging level (DEBUG, INFO, WARNING, ERROR)
|
# Logging level: DEBUG, INFO, WARNING, ERROR.
|
||||||
log_level: "INFO"
|
log_level: "INFO"
|
||||||
|
|
||||||
# Save debug frames with YOLO detections visualized
|
# If true, saves debug images for YOLO detections.
|
||||||
save_yolo_debug_frames: true
|
save_yolo_debug_frames: true
|
||||||
|
|
||||||
|
# --- Mid-Segment Re-detection ---
|
||||||
|
# Re-run YOLO at intervals within a segment to correct tracking drift.
|
||||||
|
enable_mid_segment_detection: false
|
||||||
|
redetection_interval: 30 # Frames between re-detections.
|
||||||
|
max_redetections_per_segment: 10
|
||||||
|
|
||||||
|
# --- Parallel Processing Optimizations ---
|
||||||
|
# (Experimental) Generate low-res videos for upcoming segments in the background.
|
||||||
|
enable_background_lowres_generation: false
|
||||||
|
max_concurrent_lowres: 2 # Max parallel FFmpeg processes.
|
||||||
|
lowres_segments_ahead: 2 # How many segments to prepare in advance.
|
||||||
|
use_ffmpeg_lowres: true # Use FFmpeg (faster) instead of OpenCV for low-res creation.
|
||||||
|
|
||||||
|
# --- Mask Quality Enhancement Settings ---
|
||||||
|
# These settings allow fine-tuning of the final mask appearance.
|
||||||
|
# Enabling these may increase processing time.
|
||||||
|
mask_processing:
|
||||||
|
# Edge feathering and blurring for smoother transitions.
|
||||||
|
enable_edge_blur: true
|
||||||
|
edge_blur_radius: 3
|
||||||
|
edge_blur_sigma: 0.5
|
||||||
|
|
||||||
|
# Temporal smoothing to reduce mask flickering between frames.
|
||||||
|
enable_temporal_smoothing: false
|
||||||
|
temporal_blend_weight: 0.2
|
||||||
|
temporal_history_frames: 2
|
||||||
|
|
||||||
|
# Clean up small noise and holes in the mask.
|
||||||
|
# Generally not needed when using SAM2, as its masks are high quality.
|
||||||
|
enable_morphological_cleaning: false
|
||||||
|
morphology_kernel_size: 5
|
||||||
|
min_component_size: 500
|
||||||
|
|
||||||
|
# Method for blending the mask edge with the background.
|
||||||
|
# Options: "linear" (fastest), "gaussian", "sigmoid".
|
||||||
|
alpha_blending_mode: "linear"
|
||||||
|
alpha_transition_width: 1
|
||||||
|
|
||||||
|
# Advanced edge-preserving smoothing filter. Slower but can produce higher quality edges.
|
||||||
|
enable_bilateral_filter: false
|
||||||
|
bilateral_d: 9
|
||||||
|
bilateral_sigma_color: 75
|
||||||
|
bilateral_sigma_space: 75
|
||||||
|
|||||||
@@ -185,3 +185,11 @@ class ConfigLoader:
|
|||||||
def should_cleanup_intermediate_files(self) -> bool:
|
def should_cleanup_intermediate_files(self) -> bool:
|
||||||
"""Get whether to cleanup intermediate files."""
|
"""Get whether to cleanup intermediate files."""
|
||||||
return self.config.get('advanced', {}).get('cleanup_intermediate_files', True)
|
return self.config.get('advanced', {}).get('cleanup_intermediate_files', True)
|
||||||
|
|
||||||
|
def get_stereo_iou_threshold(self) -> float:
|
||||||
|
"""Get the IOU threshold for stereo mask agreement."""
|
||||||
|
return self.config['processing'].get('stereo_iou_threshold', 0.5)
|
||||||
|
|
||||||
|
def get_confidence_reduction_factor(self) -> float:
|
||||||
|
"""Get the factor to reduce YOLO confidence by on retry."""
|
||||||
|
return self.config['processing'].get('confidence_reduction_factor', 0.8)
|
||||||
@@ -237,13 +237,21 @@ class SAM2Processor:
|
|||||||
|
|
||||||
# Fallback to synchronous creation
|
# Fallback to synchronous creation
|
||||||
try:
|
try:
|
||||||
|
logger.info(f"Creating low-res video synchronously: {input_video_path} -> {output_video_path}")
|
||||||
self.create_low_res_video(input_video_path, output_video_path, scale)
|
self.create_low_res_video(input_video_path, output_video_path, scale)
|
||||||
return os.path.exists(output_video_path) and os.path.getsize(output_video_path) > 0
|
|
||||||
|
if os.path.exists(output_video_path) and os.path.getsize(output_video_path) > 0:
|
||||||
|
logger.info(f"Successfully created low-res video: {output_video_path} ({os.path.getsize(output_video_path)} bytes)")
|
||||||
|
return True
|
||||||
|
else:
|
||||||
|
logger.error(f"Low-res video creation failed - file doesn't exist or is empty: {output_video_path}")
|
||||||
|
return False
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Failed to create low-res video {output_video_path}: {e}")
|
logger.error(f"Failed to create low-res video {output_video_path}: {e}")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
def add_yolo_prompts_to_predictor(self, inference_state, prompts: List[Dict[str, Any]]) -> bool:
|
def add_yolo_prompts_to_predictor(self, inference_state, prompts: List[Dict[str, Any]],
|
||||||
|
inference_scale: float = 1.0) -> bool:
|
||||||
"""
|
"""
|
||||||
Add YOLO detection prompts to SAM2 predictor.
|
Add YOLO detection prompts to SAM2 predictor.
|
||||||
Includes error handling matching the working spec.md implementation.
|
Includes error handling matching the working spec.md implementation.
|
||||||
@@ -251,6 +259,7 @@ class SAM2Processor:
|
|||||||
Args:
|
Args:
|
||||||
inference_state: SAM2 inference state
|
inference_state: SAM2 inference state
|
||||||
prompts: List of prompt dictionaries with obj_id and bbox
|
prompts: List of prompt dictionaries with obj_id and bbox
|
||||||
|
inference_scale: Scale factor to apply to bounding boxes
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
True if prompts were added successfully
|
True if prompts were added successfully
|
||||||
@@ -268,14 +277,20 @@ class SAM2Processor:
|
|||||||
bbox = prompt['bbox']
|
bbox = prompt['bbox']
|
||||||
confidence = prompt.get('confidence', 'unknown')
|
confidence = prompt.get('confidence', 'unknown')
|
||||||
|
|
||||||
logger.info(f"SAM2 Debug: Adding prompt {i+1}/{len(prompts)}: Object {obj_id}, bbox={bbox}, conf={confidence}")
|
# Scale bounding box for SAM2 inference resolution
|
||||||
|
scaled_bbox = bbox * inference_scale
|
||||||
|
|
||||||
|
logger.info(f"SAM2 Debug: Adding prompt {i+1}/{len(prompts)}: Object {obj_id}")
|
||||||
|
logger.info(f" Original bbox: {bbox}")
|
||||||
|
logger.info(f" Scaled bbox (scale={inference_scale}): {scaled_bbox}")
|
||||||
|
logger.info(f" Confidence: {confidence}")
|
||||||
|
|
||||||
try:
|
try:
|
||||||
_, out_obj_ids, out_mask_logits = self.predictor.add_new_points_or_box(
|
_, out_obj_ids, out_mask_logits = self.predictor.add_new_points_or_box(
|
||||||
inference_state=inference_state,
|
inference_state=inference_state,
|
||||||
frame_idx=0,
|
frame_idx=0,
|
||||||
obj_id=obj_id,
|
obj_id=obj_id,
|
||||||
box=bbox.astype(np.float32),
|
box=scaled_bbox.astype(np.float32),
|
||||||
)
|
)
|
||||||
|
|
||||||
logger.info(f"SAM2 Debug: ✓ Successfully added Object {obj_id} - returned obj_ids: {out_obj_ids}")
|
logger.info(f"SAM2 Debug: ✓ Successfully added Object {obj_id} - returned obj_ids: {out_obj_ids}")
|
||||||
@@ -443,7 +458,7 @@ class SAM2Processor:
|
|||||||
|
|
||||||
# Add prompts or previous masks
|
# Add prompts or previous masks
|
||||||
if yolo_prompts:
|
if yolo_prompts:
|
||||||
if not self.add_yolo_prompts_to_predictor(inference_state, yolo_prompts):
|
if not self.add_yolo_prompts_to_predictor(inference_state, yolo_prompts, inference_scale):
|
||||||
return None
|
return None
|
||||||
elif previous_masks:
|
elif previous_masks:
|
||||||
if not self.add_previous_masks_to_predictor(inference_state, previous_masks):
|
if not self.add_previous_masks_to_predictor(inference_state, previous_masks):
|
||||||
@@ -583,7 +598,7 @@ class SAM2Processor:
|
|||||||
inference_state = self.predictor.init_state(video_path=temp_video_path, async_loading_frames=True)
|
inference_state = self.predictor.init_state(video_path=temp_video_path, async_loading_frames=True)
|
||||||
|
|
||||||
# Add prompts
|
# Add prompts
|
||||||
if not self.add_yolo_prompts_to_predictor(inference_state, prompts):
|
if not self.add_yolo_prompts_to_predictor(inference_state, prompts, inference_scale):
|
||||||
logger.error("Failed to add prompts for first frame debug")
|
logger.error("Failed to add prompts for first frame debug")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
@@ -798,7 +813,7 @@ class SAM2Processor:
|
|||||||
eye_prompt['obj_id'] = 1 # Always use obj_id=1 for single eye
|
eye_prompt['obj_id'] = 1 # Always use obj_id=1 for single eye
|
||||||
eye_prompts.append(eye_prompt)
|
eye_prompts.append(eye_prompt)
|
||||||
|
|
||||||
if not self.add_yolo_prompts_to_predictor(inference_state, eye_prompts):
|
if not self.add_yolo_prompts_to_predictor(inference_state, eye_prompts, inference_scale):
|
||||||
logger.error(f"Failed to add prompts for {eye_side} eye")
|
logger.error(f"Failed to add prompts for {eye_side} eye")
|
||||||
return None
|
return None
|
||||||
|
|
||||||
|
|||||||
@@ -61,26 +61,36 @@ class YOLODetector:
|
|||||||
logger.error(f"Failed to load YOLO model: {e}")
|
logger.error(f"Failed to load YOLO model: {e}")
|
||||||
raise
|
raise
|
||||||
|
|
||||||
def detect_humans_in_frame(self, frame: np.ndarray) -> List[Dict[str, Any]]:
|
def detect_humans_in_frame(self, frame: np.ndarray, confidence_override: Optional[float] = None,
|
||||||
|
validate_with_detection: bool = False) -> List[Dict[str, Any]]:
|
||||||
"""
|
"""
|
||||||
Detect humans in a single frame using YOLO.
|
Detect humans in a single frame using YOLO.
|
||||||
|
|
||||||
Args:
|
Args:
|
||||||
frame: Input frame (BGR format from OpenCV)
|
frame: Input frame (BGR format from OpenCV)
|
||||||
|
confidence_override: Optional confidence to use instead of the default
|
||||||
|
validate_with_detection: If True and in segmentation mode, validate masks against detection bboxes
|
||||||
|
|
||||||
Returns:
|
Returns:
|
||||||
List of human detection dictionaries with bbox, confidence, and optionally masks
|
List of human detection dictionaries with bbox, confidence, and optionally masks
|
||||||
"""
|
"""
|
||||||
# Run YOLO detection/segmentation
|
# Run YOLO detection/segmentation
|
||||||
results = self.model(frame, conf=self.confidence_threshold, verbose=False)
|
confidence = confidence_override if confidence_override is not None else self.confidence_threshold
|
||||||
|
results = self.model(frame, conf=confidence, verbose=False)
|
||||||
|
|
||||||
human_detections = []
|
human_detections = []
|
||||||
|
|
||||||
# Process results
|
# Process results
|
||||||
for result in results:
|
for result_idx, result in enumerate(results):
|
||||||
boxes = result.boxes
|
boxes = result.boxes
|
||||||
masks = result.masks if hasattr(result, 'masks') and result.masks is not None else None
|
masks = result.masks if hasattr(result, 'masks') and result.masks is not None else None
|
||||||
|
|
||||||
|
logger.debug(f"YOLO Result {result_idx}: boxes={boxes is not None}, masks={masks is not None}")
|
||||||
|
if boxes is not None:
|
||||||
|
logger.debug(f" Found {len(boxes)} total boxes")
|
||||||
|
if masks is not None:
|
||||||
|
logger.debug(f" Found {len(masks.data)} total masks")
|
||||||
|
|
||||||
if boxes is not None:
|
if boxes is not None:
|
||||||
for i, box in enumerate(boxes):
|
for i, box in enumerate(boxes):
|
||||||
# Get class ID
|
# Get class ID
|
||||||
@@ -101,18 +111,30 @@ class YOLODetector:
|
|||||||
|
|
||||||
# Extract mask if available (segmentation mode)
|
# Extract mask if available (segmentation mode)
|
||||||
if masks is not None and i < len(masks.data):
|
if masks is not None and i < len(masks.data):
|
||||||
mask_data = masks.data[i].cpu().numpy() # Get mask for this detection
|
# Resize the raw mask to match the input frame dimensions
|
||||||
|
raw_mask = masks.data[i].cpu().numpy()
|
||||||
|
resized_mask = cv2.resize(raw_mask, (frame.shape[1], frame.shape[0]), interpolation=cv2.INTER_NEAREST)
|
||||||
|
|
||||||
|
mask_area = np.sum(resized_mask > 0.5)
|
||||||
detection['has_mask'] = True
|
detection['has_mask'] = True
|
||||||
detection['mask'] = mask_data
|
detection['mask'] = resized_mask
|
||||||
logger.debug(f"YOLO Segmentation: Detected human with mask - conf={conf:.2f}, mask_shape={mask_data.shape}")
|
logger.info(f"YOLO Segmentation: Human {len(human_detections)} - conf={conf:.3f}, raw_mask_shape={raw_mask.shape}, frame_shape={frame.shape}, resized_mask_shape={resized_mask.shape}, mask_area={mask_area}px")
|
||||||
else:
|
else:
|
||||||
logger.debug(f"YOLO Detection: Detected human with bbox - conf={conf:.2f}, bbox={coords}")
|
logger.debug(f"YOLO Detection: Human {len(human_detections)} - conf={conf:.3f}, bbox={coords} (no mask)")
|
||||||
|
|
||||||
human_detections.append(detection)
|
human_detections.append(detection)
|
||||||
|
else:
|
||||||
|
logger.debug(f"YOLO: Skipping non-human detection (class {cls})")
|
||||||
|
|
||||||
if self.supports_segmentation:
|
if self.supports_segmentation:
|
||||||
masks_found = sum(1 for d in human_detections if d['has_mask'])
|
masks_found = sum(1 for d in human_detections if d['has_mask'])
|
||||||
logger.info(f"YOLO Segmentation: Found {len(human_detections)} humans, {masks_found} with masks")
|
logger.info(f"YOLO Segmentation: Found {len(human_detections)} humans, {masks_found} with masks")
|
||||||
|
|
||||||
|
# Optional validation with detection model
|
||||||
|
if validate_with_detection and masks_found > 0:
|
||||||
|
logger.info("Validating segmentation masks with detection model...")
|
||||||
|
validated_detections = self._validate_masks_with_detection(frame, human_detections, confidence_override)
|
||||||
|
return validated_detections
|
||||||
else:
|
else:
|
||||||
logger.debug(f"YOLO Detection: Found {len(human_detections)} humans with bounding boxes")
|
logger.debug(f"YOLO Detection: Found {len(human_detections)} humans with bounding boxes")
|
||||||
|
|
||||||
@@ -1029,3 +1051,507 @@ class YOLODetector:
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
logger.error(f"Error creating {eye_side} eye debug frame: {e}")
|
logger.error(f"Error creating {eye_side} eye debug frame: {e}")
|
||||||
return False
|
return False
|
||||||
|
|
||||||
|
def _calculate_iou(self, mask1: np.ndarray, mask2: np.ndarray) -> float:
|
||||||
|
"""Calculate Intersection over Union for two masks of the same size."""
|
||||||
|
if mask1.shape != mask2.shape:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
intersection = np.logical_and(mask1, mask2).sum()
|
||||||
|
union = np.logical_or(mask1, mask2).sum()
|
||||||
|
|
||||||
|
return intersection / union if union > 0 else 0.0
|
||||||
|
|
||||||
|
def _calculate_stereo_similarity(self, left_mask: np.ndarray, right_mask: np.ndarray,
|
||||||
|
left_bbox: np.ndarray, right_bbox: np.ndarray,
|
||||||
|
left_idx: int = -1, right_idx: int = -1) -> float:
|
||||||
|
"""
|
||||||
|
Calculate stereo similarity for VR180 masks using spatial and size features.
|
||||||
|
For VR180, left and right eye views won't overlap much, so we use other metrics.
|
||||||
|
"""
|
||||||
|
logger.info(f" Starting similarity calculation L{left_idx} vs R{right_idx}")
|
||||||
|
logger.info(f" Left mask: shape={left_mask.shape}, dtype={left_mask.dtype}, min={left_mask.min()}, max={left_mask.max()}")
|
||||||
|
logger.info(f" Right mask: shape={right_mask.shape}, dtype={right_mask.dtype}, min={right_mask.min()}, max={right_mask.max()}")
|
||||||
|
logger.info(f" Left bbox: {left_bbox}")
|
||||||
|
logger.info(f" Right bbox: {right_bbox}")
|
||||||
|
if left_mask.shape != right_mask.shape:
|
||||||
|
logger.info(f" L{left_idx} vs R{right_idx}: Shape mismatch - {left_mask.shape} vs {right_mask.shape} - attempting to resize")
|
||||||
|
|
||||||
|
# Try to resize the smaller mask to match the larger one
|
||||||
|
if left_mask.size < right_mask.size:
|
||||||
|
left_mask = cv2.resize(left_mask.astype(np.float32), (right_mask.shape[1], right_mask.shape[0]), interpolation=cv2.INTER_NEAREST)
|
||||||
|
left_mask = left_mask > 0.5
|
||||||
|
logger.info(f" Resized left mask to {left_mask.shape}")
|
||||||
|
else:
|
||||||
|
right_mask = cv2.resize(right_mask.astype(np.float32), (left_mask.shape[1], left_mask.shape[0]), interpolation=cv2.INTER_NEAREST)
|
||||||
|
right_mask = right_mask > 0.5
|
||||||
|
logger.info(f" Resized right mask to {right_mask.shape}")
|
||||||
|
|
||||||
|
if left_mask.shape != right_mask.shape:
|
||||||
|
logger.warning(f" L{left_idx} vs R{right_idx}: Still shape mismatch after resize - {left_mask.shape} vs {right_mask.shape}")
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
# 1. Size similarity (area ratio)
|
||||||
|
left_area = np.sum(left_mask)
|
||||||
|
right_area = np.sum(right_mask)
|
||||||
|
|
||||||
|
if left_area == 0 or right_area == 0:
|
||||||
|
logger.debug(f" L{left_idx} vs R{right_idx}: Zero area - left={left_area}, right={right_area}")
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
area_ratio = min(left_area, right_area) / max(left_area, right_area)
|
||||||
|
|
||||||
|
# 2. Vertical position similarity (y-coordinates should be similar)
|
||||||
|
left_center_y = (left_bbox[1] + left_bbox[3]) / 2
|
||||||
|
right_center_y = (right_bbox[1] + right_bbox[3]) / 2
|
||||||
|
|
||||||
|
height = left_mask.shape[0]
|
||||||
|
y_diff = abs(left_center_y - right_center_y) / height
|
||||||
|
y_similarity = max(0, 1.0 - y_diff * 2) # Penalize vertical misalignment
|
||||||
|
|
||||||
|
# 3. Height similarity (bounding box heights should be similar)
|
||||||
|
left_height = left_bbox[3] - left_bbox[1]
|
||||||
|
right_height = right_bbox[3] - right_bbox[1]
|
||||||
|
|
||||||
|
if left_height == 0 or right_height == 0:
|
||||||
|
height_ratio = 0.0
|
||||||
|
else:
|
||||||
|
height_ratio = min(left_height, right_height) / max(left_height, right_height)
|
||||||
|
|
||||||
|
# 4. Aspect ratio similarity
|
||||||
|
left_width = left_bbox[2] - left_bbox[0]
|
||||||
|
right_width = right_bbox[2] - right_bbox[0]
|
||||||
|
|
||||||
|
if left_width == 0 or right_width == 0 or left_height == 0 or right_height == 0:
|
||||||
|
aspect_similarity = 0.0
|
||||||
|
else:
|
||||||
|
left_aspect = left_width / left_height
|
||||||
|
right_aspect = right_width / right_height
|
||||||
|
aspect_diff = abs(left_aspect - right_aspect) / max(left_aspect, right_aspect)
|
||||||
|
aspect_similarity = max(0, 1.0 - aspect_diff)
|
||||||
|
|
||||||
|
# Combine metrics with weights
|
||||||
|
similarity = (
|
||||||
|
area_ratio * 0.3 + # 30% weight on size similarity
|
||||||
|
y_similarity * 0.4 + # 40% weight on vertical alignment
|
||||||
|
height_ratio * 0.2 + # 20% weight on height similarity
|
||||||
|
aspect_similarity * 0.1 # 10% weight on aspect ratio
|
||||||
|
)
|
||||||
|
|
||||||
|
# Detailed logging for each comparison
|
||||||
|
logger.info(f" L{left_idx} vs R{right_idx}: area_ratio={area_ratio:.3f} (L={left_area}px, R={right_area}px), "
|
||||||
|
f"y_sim={y_similarity:.3f} (L_y={left_center_y:.1f}, R_y={right_center_y:.1f}, diff={y_diff:.3f}), "
|
||||||
|
f"height_ratio={height_ratio:.3f} (L_h={left_height:.1f}, R_h={right_height:.1f}), "
|
||||||
|
f"aspect_sim={aspect_similarity:.3f} (L_asp={left_aspect:.2f}, R_asp={right_aspect:.2f}), "
|
||||||
|
f"FINAL_SIMILARITY={similarity:.3f}")
|
||||||
|
|
||||||
|
return similarity
|
||||||
|
|
||||||
|
def _find_matching_mask_pairs(self, left_masks: List[Dict[str, Any]], right_masks: List[Dict[str, Any]],
|
||||||
|
similarity_threshold: float) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]], List[Dict[str, Any]]]:
|
||||||
|
"""Find the best matching pairs of masks between left and right eyes using stereo similarity."""
|
||||||
|
|
||||||
|
logger.info(f"Starting stereo mask matching with {len(left_masks)} left masks and {len(right_masks)} right masks.")
|
||||||
|
|
||||||
|
if not left_masks or not right_masks:
|
||||||
|
return [], left_masks, right_masks
|
||||||
|
|
||||||
|
# 1. Calculate all similarity scores for every possible pair
|
||||||
|
possible_pairs = []
|
||||||
|
logger.info("--- Calculating all possible stereo similarity pairs ---")
|
||||||
|
|
||||||
|
# First, log details about each mask
|
||||||
|
logger.info(f"LEFT EYE MASKS ({len(left_masks)} total):")
|
||||||
|
for i, left_detection in enumerate(left_masks):
|
||||||
|
bbox = left_detection['bbox']
|
||||||
|
mask_area = np.sum(left_detection['mask'])
|
||||||
|
conf = left_detection['confidence']
|
||||||
|
logger.info(f" L{i}: bbox=[{bbox[0]:.1f},{bbox[1]:.1f},{bbox[2]:.1f},{bbox[3]:.1f}], area={mask_area}px, conf={conf:.3f}")
|
||||||
|
|
||||||
|
logger.info(f"RIGHT EYE MASKS ({len(right_masks)} total):")
|
||||||
|
for j, right_detection in enumerate(right_masks):
|
||||||
|
bbox = right_detection['bbox']
|
||||||
|
mask_area = np.sum(right_detection['mask'])
|
||||||
|
conf = right_detection['confidence']
|
||||||
|
logger.info(f" R{j}: bbox=[{bbox[0]:.1f},{bbox[1]:.1f},{bbox[2]:.1f},{bbox[3]:.1f}], area={mask_area}px, conf={conf:.3f}")
|
||||||
|
|
||||||
|
logger.info("--- Stereo Similarity Calculations ---")
|
||||||
|
for i, left_detection in enumerate(left_masks):
|
||||||
|
for j, right_detection in enumerate(right_masks):
|
||||||
|
try:
|
||||||
|
# Use stereo similarity instead of IOU for VR180
|
||||||
|
similarity = self._calculate_stereo_similarity(
|
||||||
|
left_detection['mask'], right_detection['mask'],
|
||||||
|
left_detection['bbox'], right_detection['bbox'],
|
||||||
|
left_idx=i, right_idx=j
|
||||||
|
)
|
||||||
|
|
||||||
|
if similarity > similarity_threshold:
|
||||||
|
possible_pairs.append({'left_idx': i, 'right_idx': j, 'similarity': similarity})
|
||||||
|
logger.info(f" ✓ L{i} vs R{j}: ABOVE THRESHOLD ({similarity:.4f} > {similarity_threshold:.4f})")
|
||||||
|
else:
|
||||||
|
logger.info(f" ✗ L{i} vs R{j}: BELOW THRESHOLD ({similarity:.4f} <= {similarity_threshold:.4f})")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f" ERROR L{i} vs R{j}: Exception in similarity calculation: {e}")
|
||||||
|
similarity = 0.0
|
||||||
|
|
||||||
|
# 2. Sort pairs by similarity score in descending order to prioritize the best matches
|
||||||
|
possible_pairs.sort(key=lambda x: x['similarity'], reverse=True)
|
||||||
|
|
||||||
|
logger.debug("--- Sorted similarity pairs above threshold ---")
|
||||||
|
for pair in possible_pairs:
|
||||||
|
logger.debug(f" Pair (L{pair['left_idx']}, R{pair['right_idx']}) - Similarity: {pair['similarity']:.4f}")
|
||||||
|
|
||||||
|
matched_pairs = []
|
||||||
|
matched_left_indices = set()
|
||||||
|
matched_right_indices = set()
|
||||||
|
|
||||||
|
# 3. Iterate through sorted pairs and greedily select the best available ones
|
||||||
|
logger.debug("--- Selecting best pairs ---")
|
||||||
|
for pair in possible_pairs:
|
||||||
|
left_idx, right_idx = pair['left_idx'], pair['right_idx']
|
||||||
|
|
||||||
|
if left_idx not in matched_left_indices and right_idx not in matched_right_indices:
|
||||||
|
logger.info(f" MATCH FOUND: (L{left_idx}, R{right_idx}) with Similarity {pair['similarity']:.4f}")
|
||||||
|
matched_pairs.append({
|
||||||
|
'left_mask': left_masks[left_idx],
|
||||||
|
'right_mask': right_masks[right_idx],
|
||||||
|
'similarity': pair['similarity'] # Changed from 'iou' to 'similarity'
|
||||||
|
})
|
||||||
|
matched_left_indices.add(left_idx)
|
||||||
|
matched_right_indices.add(right_idx)
|
||||||
|
else:
|
||||||
|
logger.debug(f" Skipping pair (L{left_idx}, R{right_idx}) because one mask is already matched.")
|
||||||
|
|
||||||
|
# 4. Identify unmatched (orphan) masks
|
||||||
|
unmatched_left = [mask for i, mask in enumerate(left_masks) if i not in matched_left_indices]
|
||||||
|
unmatched_right = [mask for i, mask in enumerate(right_masks) if i not in matched_right_indices]
|
||||||
|
|
||||||
|
logger.info(f"Matching complete: Found {len(matched_pairs)} pairs. Left orphans: {len(unmatched_left)}, Right orphans: {len(unmatched_right)}.")
|
||||||
|
|
||||||
|
return matched_pairs, unmatched_left, unmatched_right
|
||||||
|
|
||||||
|
def _save_stereo_agreement_debug_frame(self, left_frame: np.ndarray, right_frame: np.ndarray,
|
||||||
|
left_detections: List[Dict[str, Any]], right_detections: List[Dict[str, Any]],
|
||||||
|
matched_pairs: List[Dict[str, Any]], unmatched_left: List[Dict[str, Any]],
|
||||||
|
unmatched_right: List[Dict[str, Any]], output_path: str, title: str):
|
||||||
|
"""Save a debug frame visualizing the stereo mask agreement process."""
|
||||||
|
try:
|
||||||
|
# Create a combined image
|
||||||
|
h, w, _ = left_frame.shape
|
||||||
|
combined_frame = np.hstack((left_frame, right_frame))
|
||||||
|
|
||||||
|
def get_centroid(mask):
|
||||||
|
m = cv2.moments(mask.astype(np.uint8), binaryImage=True)
|
||||||
|
return (int(m["m10"] / m["m00"]), int(m["m01"] / m["m00"])) if m["m00"] != 0 else (0,0)
|
||||||
|
|
||||||
|
def draw_label(frame, text, pos, color):
|
||||||
|
# Draw a black background rectangle
|
||||||
|
cv2.rectangle(frame, (pos[0], pos[1] - 14), (pos[0] + len(text) * 8, pos[1] + 5), (0,0,0), -1)
|
||||||
|
# Draw the text
|
||||||
|
cv2.putText(frame, text, pos, cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 1)
|
||||||
|
|
||||||
|
# --- Draw ALL Masks First (to ensure every mask gets a label) ---
|
||||||
|
logger.info(f"Debug Frame: Drawing {len(left_detections)} left masks and {len(right_detections)} right masks")
|
||||||
|
|
||||||
|
# Draw all left detections first
|
||||||
|
for i, detection in enumerate(left_detections):
|
||||||
|
mask = detection['mask']
|
||||||
|
mask_area = np.sum(mask > 0.5)
|
||||||
|
|
||||||
|
# Skip tiny masks that are likely noise
|
||||||
|
if mask_area < 100: # Less than 100 pixels
|
||||||
|
logger.debug(f"Skipping tiny left mask L{i} with area {mask_area}px")
|
||||||
|
continue
|
||||||
|
|
||||||
|
contours, _ = cv2.findContours(mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
|
||||||
|
if contours:
|
||||||
|
cv2.drawContours(combined_frame, contours, -1, (0, 0, 255), 2) # Default red for unmatched
|
||||||
|
c = get_centroid(mask)
|
||||||
|
if c[0] > 0 and c[1] > 0: # Valid centroid
|
||||||
|
draw_label(combined_frame, f"L{i}", c, (0, 0, 255))
|
||||||
|
logger.debug(f"Drew left mask L{i} at centroid {c}, area={mask_area}px")
|
||||||
|
|
||||||
|
# Draw all right detections
|
||||||
|
for i, detection in enumerate(right_detections):
|
||||||
|
mask = detection['mask']
|
||||||
|
mask_area = np.sum(mask > 0.5)
|
||||||
|
|
||||||
|
# Skip tiny masks that are likely noise
|
||||||
|
if mask_area < 100: # Less than 100 pixels
|
||||||
|
logger.debug(f"Skipping tiny right mask R{i} with area {mask_area}px")
|
||||||
|
continue
|
||||||
|
|
||||||
|
contours, _ = cv2.findContours(mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
|
||||||
|
if contours:
|
||||||
|
for cnt in contours:
|
||||||
|
cnt[:, :, 0] += w
|
||||||
|
cv2.drawContours(combined_frame, contours, -1, (0, 0, 255), 2) # Default red for unmatched
|
||||||
|
c_shifted = get_centroid(mask)
|
||||||
|
c = (c_shifted[0] + w, c_shifted[1])
|
||||||
|
if c[0] > w and c[1] > 0: # Valid centroid in right half
|
||||||
|
draw_label(combined_frame, f"R{i}", c, (0, 0, 255))
|
||||||
|
logger.debug(f"Drew right mask R{i} at centroid {c}, area={mask_area}px")
|
||||||
|
|
||||||
|
# --- Now Overdraw Matched Pairs in Green ---
|
||||||
|
for pair in matched_pairs:
|
||||||
|
left_mask = pair['left_mask']['mask']
|
||||||
|
right_mask = pair['right_mask']['mask']
|
||||||
|
|
||||||
|
# Find the indices from the stored pair data (should be available from matching)
|
||||||
|
left_idx = None
|
||||||
|
right_idx = None
|
||||||
|
|
||||||
|
# Find indices by comparing mask properties
|
||||||
|
for i, det in enumerate(left_detections):
|
||||||
|
if (np.array_equal(det['bbox'], pair['left_mask']['bbox']) and
|
||||||
|
abs(det['confidence'] - pair['left_mask']['confidence']) < 0.001):
|
||||||
|
left_idx = i
|
||||||
|
break
|
||||||
|
|
||||||
|
for i, det in enumerate(right_detections):
|
||||||
|
if (np.array_equal(det['bbox'], pair['right_mask']['bbox']) and
|
||||||
|
abs(det['confidence'] - pair['right_mask']['confidence']) < 0.001):
|
||||||
|
right_idx = i
|
||||||
|
break
|
||||||
|
|
||||||
|
# Draw left mask in green (matched)
|
||||||
|
contours, _ = cv2.findContours(left_mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
|
||||||
|
if contours:
|
||||||
|
cv2.drawContours(combined_frame, contours, -1, (0, 255, 0), 3) # Thicker green line
|
||||||
|
c1 = get_centroid(left_mask)
|
||||||
|
if c1[0] > 0 and c1[1] > 0:
|
||||||
|
draw_label(combined_frame, f"L{left_idx if left_idx is not None else '?'}", c1, (0, 255, 0))
|
||||||
|
|
||||||
|
# Draw right mask in green (matched)
|
||||||
|
contours, _ = cv2.findContours(right_mask.astype(np.uint8), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
|
||||||
|
if contours:
|
||||||
|
for cnt in contours:
|
||||||
|
cnt[:, :, 0] += w
|
||||||
|
cv2.drawContours(combined_frame, contours, -1, (0, 255, 0), 3) # Thicker green line
|
||||||
|
c2_shifted = get_centroid(right_mask)
|
||||||
|
c2 = (c2_shifted[0] + w, c2_shifted[1])
|
||||||
|
if c2[0] > w and c2[1] > 0:
|
||||||
|
draw_label(combined_frame, f"R{right_idx if right_idx is not None else '?'}", c2, (0, 255, 0))
|
||||||
|
|
||||||
|
# Draw line connecting centroids and similarity score
|
||||||
|
cv2.line(combined_frame, c1, c2, (0, 255, 0), 2)
|
||||||
|
similarity_text = f"Sim: {pair.get('similarity', pair.get('iou', 0)):.2f}"
|
||||||
|
cv2.putText(combined_frame, similarity_text, (c1[0] + 10, c1[1] + 20), cv2.FONT_HERSHEY_SIMPLEX, 0.6, (255, 255, 255), 2)
|
||||||
|
|
||||||
|
# Add title
|
||||||
|
cv2.putText(combined_frame, title, (10, 30), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (255, 255, 255), 2)
|
||||||
|
|
||||||
|
cv2.imwrite(output_path, combined_frame)
|
||||||
|
logger.info(f"Saved stereo agreement debug frame to {output_path}")
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to create stereo agreement debug frame: {e}")
|
||||||
|
|
||||||
|
def detect_and_match_stereo_pairs(self, frame: np.ndarray, confidence_reduction_factor: float,
|
||||||
|
stereo_similarity_threshold: float, segment_info: dict, save_debug_frames: bool) -> List[Dict[str, Any]]:
|
||||||
|
"""The main method to detect and match stereo mask pairs."""
|
||||||
|
frame_height, frame_width, _ = frame.shape
|
||||||
|
half_width = frame_width // 2
|
||||||
|
|
||||||
|
left_eye_frame = frame[:, :half_width]
|
||||||
|
right_eye_frame = frame[:, half_width:half_width*2] # Ensure exact same width
|
||||||
|
|
||||||
|
logger.info(f"VR180 Frame Split: Original={frame.shape}, Left={left_eye_frame.shape}, Right={right_eye_frame.shape}")
|
||||||
|
|
||||||
|
# Initial detection with validation
|
||||||
|
logger.info(f"Running initial stereo detection at {self.confidence_threshold} confidence.")
|
||||||
|
left_detections = self.detect_humans_in_frame(left_eye_frame, validate_with_detection=True)
|
||||||
|
right_detections = self.detect_humans_in_frame(right_eye_frame, validate_with_detection=True)
|
||||||
|
|
||||||
|
# Convert IOU threshold to similarity threshold (IOU 0.5 ≈ similarity 0.3)
|
||||||
|
similarity_threshold = max(0.2, stereo_similarity_threshold * 0.6)
|
||||||
|
matched_pairs, unmatched_left, unmatched_right = self._find_matching_mask_pairs(left_detections, right_detections, similarity_threshold)
|
||||||
|
|
||||||
|
if save_debug_frames:
|
||||||
|
debug_path = os.path.join(segment_info['directory'], "yolo_stereo_agreement_initial.jpg")
|
||||||
|
title = f"Initial Attempt (Conf: {self.confidence_threshold:.2f}) - {len(matched_pairs)} Pairs"
|
||||||
|
self._save_stereo_agreement_debug_frame(left_eye_frame, right_eye_frame, left_detections, right_detections, matched_pairs, unmatched_left, unmatched_right, debug_path, title)
|
||||||
|
|
||||||
|
# Retry with lower confidence if no pairs found
|
||||||
|
if not matched_pairs:
|
||||||
|
new_confidence = self.confidence_threshold * confidence_reduction_factor
|
||||||
|
logger.info(f"No valid pairs found. Reducing confidence to {new_confidence:.2f} and retrying.")
|
||||||
|
|
||||||
|
left_detections = self.detect_humans_in_frame(left_eye_frame, confidence_override=new_confidence, validate_with_detection=True)
|
||||||
|
right_detections = self.detect_humans_in_frame(right_eye_frame, confidence_override=new_confidence, validate_with_detection=True)
|
||||||
|
|
||||||
|
matched_pairs, unmatched_left, unmatched_right = self._find_matching_mask_pairs(left_detections, right_detections, similarity_threshold)
|
||||||
|
|
||||||
|
if save_debug_frames:
|
||||||
|
debug_path = os.path.join(segment_info['directory'], "yolo_stereo_agreement_retry.jpg")
|
||||||
|
title = f"Retry Attempt (Conf: {new_confidence:.2f}) - {len(matched_pairs)} Pairs"
|
||||||
|
self._save_stereo_agreement_debug_frame(left_eye_frame, right_eye_frame, left_detections, right_detections, matched_pairs, unmatched_left, unmatched_right, debug_path, title)
|
||||||
|
|
||||||
|
# Prepare final results - convert to full-frame coordinates and masks
|
||||||
|
final_prompts = []
|
||||||
|
if matched_pairs:
|
||||||
|
logger.info(f"Found {len(matched_pairs)} valid stereo pairs.")
|
||||||
|
for i, pair in enumerate(matched_pairs):
|
||||||
|
# Convert eye-specific coordinates and masks to full-frame
|
||||||
|
left_bbox_full_frame, left_mask_full_frame = self._convert_eye_to_full_frame(
|
||||||
|
pair['left_mask']['bbox'], pair['left_mask']['mask'],
|
||||||
|
'left', frame_width, frame_height
|
||||||
|
)
|
||||||
|
|
||||||
|
right_bbox_full_frame, right_mask_full_frame = self._convert_eye_to_full_frame(
|
||||||
|
pair['right_mask']['bbox'], pair['right_mask']['mask'],
|
||||||
|
'right', frame_width, frame_height
|
||||||
|
)
|
||||||
|
|
||||||
|
logger.info(f"Stereo Pair {i}: Left bbox {pair['left_mask']['bbox']} -> {left_bbox_full_frame}")
|
||||||
|
logger.info(f"Stereo Pair {i}: Right bbox {pair['right_mask']['bbox']} -> {right_bbox_full_frame}")
|
||||||
|
|
||||||
|
# Create prompts for SAM2 with full-frame coordinates and masks
|
||||||
|
final_prompts.append({
|
||||||
|
'obj_id': i * 2 + 1,
|
||||||
|
'bbox': left_bbox_full_frame,
|
||||||
|
'mask': left_mask_full_frame
|
||||||
|
})
|
||||||
|
final_prompts.append({
|
||||||
|
'obj_id': i * 2 + 2,
|
||||||
|
'bbox': right_bbox_full_frame,
|
||||||
|
'mask': right_mask_full_frame
|
||||||
|
})
|
||||||
|
else:
|
||||||
|
logger.warning("No valid stereo pairs found after all attempts.")
|
||||||
|
|
||||||
|
return final_prompts
|
||||||
|
|
||||||
|
def _convert_eye_to_full_frame(self, eye_bbox: np.ndarray, eye_mask: np.ndarray,
|
||||||
|
eye_side: str, full_frame_width: int, full_frame_height: int) -> tuple:
|
||||||
|
"""
|
||||||
|
Convert eye-specific bounding box and mask to full-frame coordinates.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
eye_bbox: Bounding box in eye coordinate system
|
||||||
|
eye_mask: Mask in eye coordinate system
|
||||||
|
eye_side: 'left' or 'right'
|
||||||
|
full_frame_width: Width of the full VR180 frame
|
||||||
|
full_frame_height: Height of the full VR180 frame
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Tuple of (full_frame_bbox, full_frame_mask)
|
||||||
|
"""
|
||||||
|
half_width = full_frame_width // 2
|
||||||
|
|
||||||
|
# Convert bounding box coordinates
|
||||||
|
full_frame_bbox = eye_bbox.copy()
|
||||||
|
|
||||||
|
if eye_side == 'right':
|
||||||
|
# Shift right eye coordinates by half_width
|
||||||
|
full_frame_bbox[0] += half_width # x1
|
||||||
|
full_frame_bbox[2] += half_width # x2
|
||||||
|
|
||||||
|
# Create full-frame mask
|
||||||
|
full_frame_mask = np.zeros((full_frame_height, full_frame_width), dtype=eye_mask.dtype)
|
||||||
|
|
||||||
|
if eye_side == 'left':
|
||||||
|
# Place left eye mask in left half
|
||||||
|
eye_height, eye_width = eye_mask.shape
|
||||||
|
target_height = min(eye_height, full_frame_height)
|
||||||
|
target_width = min(eye_width, half_width)
|
||||||
|
full_frame_mask[:target_height, :target_width] = eye_mask[:target_height, :target_width]
|
||||||
|
else: # right
|
||||||
|
# Place right eye mask in right half
|
||||||
|
eye_height, eye_width = eye_mask.shape
|
||||||
|
target_height = min(eye_height, full_frame_height)
|
||||||
|
target_width = min(eye_width, half_width)
|
||||||
|
full_frame_mask[:target_height, half_width:half_width+target_width] = eye_mask[:target_height, :target_width]
|
||||||
|
|
||||||
|
logger.debug(f"Converted {eye_side} eye: bbox {eye_bbox} -> {full_frame_bbox}, "
|
||||||
|
f"mask {eye_mask.shape} -> {full_frame_mask.shape}, "
|
||||||
|
f"mask_pixels: {np.sum(eye_mask > 0.5)} -> {np.sum(full_frame_mask > 0.5)}")
|
||||||
|
|
||||||
|
return full_frame_bbox, full_frame_mask
|
||||||
|
|
||||||
|
def _validate_masks_with_detection(self, frame: np.ndarray, segmentation_detections: List[Dict[str, Any]],
|
||||||
|
confidence_override: Optional[float] = None) -> List[Dict[str, Any]]:
|
||||||
|
"""
|
||||||
|
Validate segmentation masks by checking if they overlap with detection bounding boxes.
|
||||||
|
This helps filter out spurious mask regions that aren't actually humans.
|
||||||
|
"""
|
||||||
|
if not hasattr(self, '_detection_model'):
|
||||||
|
# Load detection model for validation
|
||||||
|
try:
|
||||||
|
detection_model_path = self.model_path.replace('-seg.pt', '.pt') # Try to find detection version
|
||||||
|
if not os.path.exists(detection_model_path):
|
||||||
|
detection_model_path = "yolo11l.pt" # Fallback to default
|
||||||
|
|
||||||
|
logger.info(f"Loading detection model for validation: {detection_model_path}")
|
||||||
|
self._detection_model = YOLO(detection_model_path)
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Could not load detection model for validation: {e}")
|
||||||
|
return segmentation_detections
|
||||||
|
|
||||||
|
# Run detection model
|
||||||
|
confidence = confidence_override if confidence_override is not None else self.confidence_threshold
|
||||||
|
detection_results = self._detection_model(frame, conf=confidence, verbose=False)
|
||||||
|
|
||||||
|
# Extract detection bounding boxes
|
||||||
|
detection_bboxes = []
|
||||||
|
for result in detection_results:
|
||||||
|
if result.boxes is not None:
|
||||||
|
for box in result.boxes:
|
||||||
|
cls = int(box.cls.cpu().numpy()[0])
|
||||||
|
if cls == self.human_class_id:
|
||||||
|
coords = box.xyxy[0].cpu().numpy()
|
||||||
|
conf = float(box.conf.cpu().numpy()[0])
|
||||||
|
detection_bboxes.append({'bbox': coords, 'confidence': conf})
|
||||||
|
|
||||||
|
logger.info(f"Validation: Found {len(detection_bboxes)} detection bboxes vs {len(segmentation_detections)} segmentation masks")
|
||||||
|
|
||||||
|
# Validate each segmentation mask against detection bboxes
|
||||||
|
validated_detections = []
|
||||||
|
for seg_det in segmentation_detections:
|
||||||
|
if not seg_det['has_mask']:
|
||||||
|
validated_detections.append(seg_det)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Check if this mask overlaps significantly with any detection bbox
|
||||||
|
mask = seg_det['mask']
|
||||||
|
seg_bbox = seg_det['bbox']
|
||||||
|
|
||||||
|
best_overlap = 0.0
|
||||||
|
best_detection = None
|
||||||
|
|
||||||
|
for det_bbox_info in detection_bboxes:
|
||||||
|
det_bbox = det_bbox_info['bbox']
|
||||||
|
overlap = self._calculate_bbox_overlap(seg_bbox, det_bbox)
|
||||||
|
if overlap > best_overlap:
|
||||||
|
best_overlap = overlap
|
||||||
|
best_detection = det_bbox_info
|
||||||
|
|
||||||
|
if best_overlap > 0.3: # 30% overlap threshold
|
||||||
|
logger.info(f"Validation: Segmentation mask validated (overlap={best_overlap:.3f} with detection conf={best_detection['confidence']:.3f})")
|
||||||
|
validated_detections.append(seg_det)
|
||||||
|
else:
|
||||||
|
mask_area = np.sum(mask > 0.5)
|
||||||
|
logger.warning(f"Validation: Rejecting segmentation mask with low overlap ({best_overlap:.3f}) - area={mask_area}px")
|
||||||
|
|
||||||
|
logger.info(f"Validation: Kept {len(validated_detections)}/{len(segmentation_detections)} segmentation masks")
|
||||||
|
return validated_detections
|
||||||
|
|
||||||
|
def _calculate_bbox_overlap(self, bbox1: np.ndarray, bbox2: np.ndarray) -> float:
|
||||||
|
"""Calculate the overlap ratio between two bounding boxes."""
|
||||||
|
# Calculate intersection
|
||||||
|
x1 = max(bbox1[0], bbox2[0])
|
||||||
|
y1 = max(bbox1[1], bbox2[1])
|
||||||
|
x2 = min(bbox1[2], bbox2[2])
|
||||||
|
y2 = min(bbox1[3], bbox2[3])
|
||||||
|
|
||||||
|
if x2 <= x1 or y2 <= y1:
|
||||||
|
return 0.0
|
||||||
|
|
||||||
|
intersection = (x2 - x1) * (y2 - y1)
|
||||||
|
|
||||||
|
# Calculate areas
|
||||||
|
area1 = (bbox1[2] - bbox1[0]) * (bbox1[3] - bbox1[1])
|
||||||
|
area2 = (bbox2[2] - bbox2[0]) * (bbox2[3] - bbox2[1])
|
||||||
|
|
||||||
|
# Return intersection over smaller area (more lenient than IoU)
|
||||||
|
return intersection / min(area1, area2) if min(area1, area2) > 0 else 0.0
|
||||||
|
|||||||
198
main.py
198
main.py
@@ -681,138 +681,41 @@ async def main_async():
|
|||||||
previous_masks = None
|
previous_masks = None
|
||||||
|
|
||||||
if use_detections:
|
if use_detections:
|
||||||
# Run YOLO detection on current segment
|
# Run YOLO stereo detection and matching on current segment
|
||||||
logger.info(f"Running YOLO detection on segment {segment_idx}")
|
logger.info(f"Running stereo pair detection on segment {segment_idx}")
|
||||||
detection_file = os.path.join(segment_info['directory'], "yolo_detections")
|
|
||||||
|
|
||||||
# Check if detection already exists
|
# Load the first frame for detection
|
||||||
if os.path.exists(detection_file):
|
|
||||||
logger.info(f"Loading existing YOLO detections for segment {segment_idx}")
|
|
||||||
detections = detector.load_detections_from_file(detection_file)
|
|
||||||
else:
|
|
||||||
# Run YOLO detection on first frame
|
|
||||||
detections = detector.detect_humans_in_video_first_frame(
|
|
||||||
segment_info['video_file'],
|
|
||||||
scale=config.get_inference_scale()
|
|
||||||
)
|
|
||||||
# Save detections for future runs
|
|
||||||
detector.save_detections_to_file(detections, detection_file)
|
|
||||||
|
|
||||||
if detections:
|
|
||||||
total_humans_detected += len(detections)
|
|
||||||
logger.info(f"Found {len(detections)} humans in segment {segment_idx}")
|
|
||||||
|
|
||||||
# Get frame width from video
|
|
||||||
cap = cv2.VideoCapture(segment_info['video_file'])
|
cap = cv2.VideoCapture(segment_info['video_file'])
|
||||||
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
|
ret, frame = cap.read()
|
||||||
cap.release()
|
cap.release()
|
||||||
|
|
||||||
yolo_prompts = detector.convert_detections_to_sam2_prompts(
|
if not ret:
|
||||||
detections, frame_width
|
logger.error(f"Could not read first frame of segment {segment_idx}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Scale frame if needed
|
||||||
|
if config.get_inference_scale() != 1.0:
|
||||||
|
frame = cv2.resize(frame, None, fx=config.get_inference_scale(), fy=config.get_inference_scale(), interpolation=cv2.INTER_LINEAR)
|
||||||
|
|
||||||
|
yolo_prompts = detector.detect_and_match_stereo_pairs(
|
||||||
|
frame,
|
||||||
|
config.get_confidence_reduction_factor(),
|
||||||
|
config.get_stereo_iou_threshold(),
|
||||||
|
segment_info,
|
||||||
|
config.get('advanced.save_yolo_debug_frames', True)
|
||||||
)
|
)
|
||||||
|
|
||||||
# If no right eye detections found, run debug analysis with lower confidence
|
if not yolo_prompts:
|
||||||
half_frame_width = frame_width // 2
|
logger.warning(f"No valid stereo pairs found for segment {segment_idx}. Attempting to use previous segment's mask.")
|
||||||
right_eye_detections = [d for d in detections if (d['bbox'][0] + d['bbox'][2]) / 2 >= half_frame_width]
|
if segment_idx > 0:
|
||||||
|
prev_segment_dir = segments_info[segment_idx - 1]['directory']
|
||||||
if len(right_eye_detections) == 0 and config.get('advanced.save_yolo_debug_frames', False):
|
previous_masks = sam2_processor.load_previous_segment_mask(prev_segment_dir)
|
||||||
logger.info(f"VR180 Debug: No right eye detections found, running lower confidence analysis...")
|
if previous_masks:
|
||||||
|
logger.info(f"Using masks from segment {segment_idx - 1} as fallback.")
|
||||||
# Load first frame for debug analysis
|
|
||||||
cap = cv2.VideoCapture(segment_info['video_file'])
|
|
||||||
ret, debug_frame = cap.read()
|
|
||||||
cap.release()
|
|
||||||
|
|
||||||
if ret:
|
|
||||||
# Scale frame to match detection scale
|
|
||||||
if config.get_inference_scale() != 1.0:
|
|
||||||
scale = config.get_inference_scale()
|
|
||||||
debug_frame = cv2.resize(debug_frame, None, fx=scale, fy=scale, interpolation=cv2.INTER_LINEAR)
|
|
||||||
|
|
||||||
# Run debug detection with lower confidence
|
|
||||||
debug_detections = detector.debug_detect_with_lower_confidence(debug_frame, debug_confidence=0.3)
|
|
||||||
|
|
||||||
# Analyze where these lower confidence detections are
|
|
||||||
debug_right_eye = [d for d in debug_detections if (d['bbox'][0] + d['bbox'][2]) / 2 >= half_frame_width]
|
|
||||||
|
|
||||||
if len(debug_right_eye) > 0:
|
|
||||||
logger.warning(f"VR180 Debug: Found {len(debug_right_eye)} right eye detections with lower confidence!")
|
|
||||||
for i, det in enumerate(debug_right_eye):
|
|
||||||
logger.warning(f"VR180 Debug: Right eye detection {i+1}: conf={det['confidence']:.3f}, bbox={det['bbox']}")
|
|
||||||
logger.warning(f"VR180 Debug: Consider lowering yolo_confidence from {config.get_yolo_confidence()} to 0.3-0.4")
|
|
||||||
else:
|
else:
|
||||||
logger.info(f"VR180 Debug: No right eye detections found even with confidence 0.3")
|
logger.error(f"Fallback failed: No previous mask found for segment {segment_idx}.")
|
||||||
logger.info(f"VR180 Debug: This confirms person is not visible in right eye view")
|
|
||||||
|
|
||||||
logger.info(f"Pipeline Debug: Segment {segment_idx} - Generated {len(yolo_prompts)} SAM2 prompts from {len(detections)} YOLO detections")
|
|
||||||
|
|
||||||
# Save debug frame with detections visualized (if enabled)
|
|
||||||
if config.get('advanced.save_yolo_debug_frames', False):
|
|
||||||
debug_frame_path = os.path.join(segment_info['directory'], "yolo_debug.jpg")
|
|
||||||
|
|
||||||
# Load first frame for debug visualization
|
|
||||||
cap = cv2.VideoCapture(segment_info['video_file'])
|
|
||||||
ret, debug_frame = cap.read()
|
|
||||||
cap.release()
|
|
||||||
|
|
||||||
if ret:
|
|
||||||
# Scale frame to match detection scale
|
|
||||||
if config.get_inference_scale() != 1.0:
|
|
||||||
scale = config.get_inference_scale()
|
|
||||||
debug_frame = cv2.resize(debug_frame, None, fx=scale, fy=scale, interpolation=cv2.INTER_LINEAR)
|
|
||||||
|
|
||||||
detector.save_debug_frame_with_detections(debug_frame, detections, debug_frame_path, yolo_prompts)
|
|
||||||
else:
|
else:
|
||||||
logger.warning(f"Could not load frame for debug visualization in segment {segment_idx}")
|
logger.error("Cannot use fallback for the first segment.")
|
||||||
|
|
||||||
# Check if we have YOLO masks for debug visualization
|
|
||||||
has_yolo_masks = False
|
|
||||||
if detections and detector.supports_segmentation:
|
|
||||||
has_yolo_masks = any(d.get('has_mask', False) for d in detections)
|
|
||||||
|
|
||||||
# Generate first frame masks debug (SAM2 or YOLO)
|
|
||||||
first_frame_debug_path = os.path.join(segment_info['directory'], "first_frame_detection.jpg")
|
|
||||||
|
|
||||||
if has_yolo_masks:
|
|
||||||
logger.info(f"Pipeline Debug: Generating YOLO first frame masks for segment {segment_idx}")
|
|
||||||
# Create YOLO mask debug visualization
|
|
||||||
create_yolo_mask_debug_frame(detections, segment_info['video_file'], first_frame_debug_path, config.get_inference_scale())
|
|
||||||
else:
|
|
||||||
logger.info(f"Pipeline Debug: Generating SAM2 first frame masks for segment {segment_idx}")
|
|
||||||
sam2_processor.generate_first_frame_debug_masks(
|
|
||||||
segment_info['video_file'],
|
|
||||||
yolo_prompts,
|
|
||||||
first_frame_debug_path,
|
|
||||||
config.get_inference_scale()
|
|
||||||
)
|
|
||||||
else:
|
|
||||||
logger.warning(f"No humans detected in segment {segment_idx}")
|
|
||||||
|
|
||||||
# Save debug frame even when no detections (if enabled)
|
|
||||||
if config.get('advanced.save_yolo_debug_frames', False):
|
|
||||||
debug_frame_path = os.path.join(segment_info['directory'], "yolo_debug_no_detections.jpg")
|
|
||||||
|
|
||||||
# Load first frame for debug visualization
|
|
||||||
cap = cv2.VideoCapture(segment_info['video_file'])
|
|
||||||
ret, debug_frame = cap.read()
|
|
||||||
cap.release()
|
|
||||||
|
|
||||||
if ret:
|
|
||||||
# Scale frame to match detection scale
|
|
||||||
if config.get_inference_scale() != 1.0:
|
|
||||||
scale = config.get_inference_scale()
|
|
||||||
debug_frame = cv2.resize(debug_frame, None, fx=scale, fy=scale, interpolation=cv2.INTER_LINEAR)
|
|
||||||
|
|
||||||
# Add "No detections" text overlay
|
|
||||||
cv2.putText(debug_frame, "YOLO: No humans detected",
|
|
||||||
(10, 30),
|
|
||||||
cv2.FONT_HERSHEY_SIMPLEX, 1.0,
|
|
||||||
(0, 0, 255), 2) # Red text
|
|
||||||
|
|
||||||
cv2.imwrite(debug_frame_path, debug_frame)
|
|
||||||
logger.info(f"Saved no-detection debug frame to {debug_frame_path}")
|
|
||||||
else:
|
|
||||||
logger.warning(f"Could not load frame for no-detection debug visualization in segment {segment_idx}")
|
|
||||||
elif segment_idx > 0:
|
elif segment_idx > 0:
|
||||||
# Try to load previous segment mask
|
# Try to load previous segment mask
|
||||||
for j in range(segment_idx - 1, -1, -1):
|
for j in range(segment_idx - 1, -1, -1):
|
||||||
@@ -826,43 +729,20 @@ async def main_async():
|
|||||||
logger.error(f"No prompts or previous masks available for segment {segment_idx}")
|
logger.error(f"No prompts or previous masks available for segment {segment_idx}")
|
||||||
continue
|
continue
|
||||||
|
|
||||||
# Check if we have YOLO masks and can skip SAM2 (recheck in case detections were loaded from file)
|
# Check if we have YOLO masks from the stereo pair matching and can use them as initial masks for SAM2
|
||||||
if not 'has_yolo_masks' in locals():
|
if yolo_prompts and detector.supports_segmentation:
|
||||||
has_yolo_masks = False
|
logger.info(f"Pipeline Debug: YOLO segmentation provided matched stereo masks - using as SAM2 initial masks.")
|
||||||
if detections and detector.supports_segmentation:
|
|
||||||
has_yolo_masks = any(d.get('has_mask', False) for d in detections)
|
|
||||||
|
|
||||||
if has_yolo_masks:
|
# Convert the prompts (which contain masks) into the initial_masks format for SAM2
|
||||||
logger.info(f"Pipeline Debug: YOLO segmentation provided masks - using as SAM2 initial masks for segment {segment_idx}")
|
initial_masks = {prompt['obj_id']: prompt['mask'] for prompt in yolo_prompts if 'mask' in prompt}
|
||||||
|
|
||||||
# Convert YOLO masks to initial masks for SAM2
|
if initial_masks:
|
||||||
cap = cv2.VideoCapture(segment_info['video_file'])
|
# We are providing initial masks, so we should not provide bbox prompts
|
||||||
frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
|
previous_masks = initial_masks
|
||||||
frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
|
yolo_prompts = None
|
||||||
cap.release()
|
logger.info(f"Pipeline Debug: Using {len(previous_masks)} YOLO masks as SAM2 initial masks.")
|
||||||
|
else:
|
||||||
# Convert YOLO masks to the format expected by SAM2 add_previous_masks_to_predictor
|
logger.warning("YOLO segmentation mode is on, but no masks were found in the final prompts.")
|
||||||
yolo_masks_dict = {}
|
|
||||||
for i, detection in enumerate(detections[:2]): # Up to 2 objects
|
|
||||||
if detection.get('has_mask', False):
|
|
||||||
mask = detection['mask']
|
|
||||||
# Resize mask to match inference scale
|
|
||||||
if config.get_inference_scale() != 1.0:
|
|
||||||
scale = config.get_inference_scale()
|
|
||||||
scaled_height = int(frame_height * scale)
|
|
||||||
scaled_width = int(frame_width * scale)
|
|
||||||
mask = cv2.resize(mask.astype(np.float32), (scaled_width, scaled_height), interpolation=cv2.INTER_NEAREST)
|
|
||||||
mask = mask > 0.5
|
|
||||||
|
|
||||||
obj_id = i + 1 # Sequential object IDs
|
|
||||||
yolo_masks_dict[obj_id] = mask.astype(bool)
|
|
||||||
logger.info(f"Pipeline Debug: YOLO mask for Object {obj_id} - shape: {mask.shape}, pixels: {np.sum(mask)}")
|
|
||||||
|
|
||||||
logger.info(f"Pipeline Debug: Using YOLO masks as SAM2 initial masks - {len(yolo_masks_dict)} objects")
|
|
||||||
|
|
||||||
# Use traditional SAM2 pipeline with YOLO masks as initial masks
|
|
||||||
previous_masks = yolo_masks_dict
|
|
||||||
yolo_prompts = None # Don't use bounding box prompts when we have masks
|
|
||||||
|
|
||||||
# Debug what we're passing to SAM2
|
# Debug what we're passing to SAM2
|
||||||
if yolo_prompts:
|
if yolo_prompts:
|
||||||
|
|||||||
198
sbs_spec.md
Normal file
198
sbs_spec.md
Normal file
@@ -0,0 +1,198 @@
|
|||||||
|
# Plan: Separate Left/Right Eye Processing for VR180 SAM2 Pipeline
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
Implement a new processing mode that splits VR180 side-by-side frames into separate left and right halves, processes each eye independently through SAM2, then recombines them into the final output. This should improve tracking accuracy by removing parallax confusion between eyes.
|
||||||
|
|
||||||
|
## Key Changes Required
|
||||||
|
|
||||||
|
### 1. Configuration Updates
|
||||||
|
**File: `config.yaml`**
|
||||||
|
- Add new configuration option: `processing.separate_eye_processing: false` (default off for backward compatibility)
|
||||||
|
- Add related options:
|
||||||
|
- `processing.enable_greenscreen_fallback: true` (render full green if no humans detected)
|
||||||
|
- `processing.eye_overlap_pixels: 0` (optional overlap for blending)
|
||||||
|
|
||||||
|
### 2. Core SAM2 Processor Enhancements
|
||||||
|
**File: `core/sam2_processor.py`**
|
||||||
|
|
||||||
|
#### New Methods:
|
||||||
|
- `split_frame_into_eyes(frame) -> (left_frame, right_frame)`
|
||||||
|
- `split_video_into_eyes(video_path, left_output, right_output, scale)`
|
||||||
|
- `process_single_eye_segment(segment_info, eye_side, yolo_prompts, previous_masks, inference_scale)`
|
||||||
|
- `combine_eye_masks(left_masks, right_masks, full_frame_shape) -> combined_masks`
|
||||||
|
- `create_greenscreen_segment(segment_info, duration_seconds) -> bool`
|
||||||
|
|
||||||
|
#### Modified Methods:
|
||||||
|
- `process_single_segment()` - Add branch for separate eye processing mode
|
||||||
|
- New processing flow:
|
||||||
|
1. Check if separate_eye_processing enabled
|
||||||
|
2. If enabled: split segment video into left/right eye videos
|
||||||
|
3. Process each eye independently with SAM2
|
||||||
|
4. Combine masks back to full frame format
|
||||||
|
5. If fallback needed: create full greenscreen segment
|
||||||
|
|
||||||
|
### 3. YOLO Detector Enhancements
|
||||||
|
**File: `core/yolo_detector.py`**
|
||||||
|
|
||||||
|
#### New Methods:
|
||||||
|
- `detect_humans_in_single_eye(frame, eye_side) -> List[Dict]`
|
||||||
|
- `convert_eye_detections_to_sam2_prompts(detections, eye_side) -> List[Dict]`
|
||||||
|
- `has_any_detections(detections_list) -> bool`
|
||||||
|
|
||||||
|
#### Modified Methods:
|
||||||
|
- `detect_humans_in_video_first_frame()` - Add eye-specific detection support
|
||||||
|
- Object ID assignment: Always use obj_id=1 for single-eye processing (since each eye is processed independently)
|
||||||
|
|
||||||
|
### 4. Mask Processor Updates
|
||||||
|
**File: `core/mask_processor.py`**
|
||||||
|
|
||||||
|
#### New Methods:
|
||||||
|
- `create_full_greenscreen_frame(frame_shape) -> np.ndarray`
|
||||||
|
- `process_greenscreen_only_segment(segment_info, frame_count) -> bool`
|
||||||
|
|
||||||
|
#### Modified Methods:
|
||||||
|
- `apply_green_mask()` - Handle combined eye masks properly
|
||||||
|
- Add support for full-greenscreen fallback when no humans detected
|
||||||
|
|
||||||
|
### 5. Main Pipeline Integration
|
||||||
|
**File: `main.py`**
|
||||||
|
|
||||||
|
#### Processing Flow Changes:
|
||||||
|
```python
|
||||||
|
# For each segment:
|
||||||
|
if config.get('processing.separate_eye_processing', False):
|
||||||
|
# 1. Run YOLO on full frame to check for ANY human presence
|
||||||
|
full_frame_detections = detector.detect_humans_in_video_first_frame(segment_video)
|
||||||
|
|
||||||
|
if not full_frame_detections:
|
||||||
|
# No humans detected anywhere - create full greenscreen segment
|
||||||
|
success = mask_processor.process_greenscreen_only_segment(segment_info, expected_frame_count)
|
||||||
|
continue
|
||||||
|
|
||||||
|
# 2. Split detections by eye and process separately
|
||||||
|
left_detections = [d for d in full_frame_detections if is_in_left_half(d, frame_width)]
|
||||||
|
right_detections = [d for d in full_frame_detections if is_in_right_half(d, frame_width)]
|
||||||
|
|
||||||
|
# 3. Process left eye (if detections exist)
|
||||||
|
left_masks = None
|
||||||
|
if left_detections:
|
||||||
|
left_eye_prompts = detector.convert_eye_detections_to_sam2_prompts(left_detections, 'left')
|
||||||
|
left_masks = sam2_processor.process_single_eye_segment(segment_info, 'left', left_eye_prompts, previous_left_masks, inference_scale)
|
||||||
|
|
||||||
|
# 4. Process right eye (if detections exist)
|
||||||
|
right_masks = None
|
||||||
|
if right_detections:
|
||||||
|
right_eye_prompts = detector.convert_eye_detections_to_sam2_prompts(right_detections, 'right')
|
||||||
|
right_masks = sam2_processor.process_single_eye_segment(segment_info, 'right', right_eye_prompts, previous_right_masks, inference_scale)
|
||||||
|
|
||||||
|
# 5. Combine masks back to full frame format
|
||||||
|
if left_masks or right_masks:
|
||||||
|
combined_masks = sam2_processor.combine_eye_masks(left_masks, right_masks, full_frame_shape)
|
||||||
|
# Continue with normal mask processing...
|
||||||
|
else:
|
||||||
|
# Neither eye had trackable humans - full greenscreen fallback
|
||||||
|
success = mask_processor.process_greenscreen_only_segment(segment_info, expected_frame_count)
|
||||||
|
|
||||||
|
else:
|
||||||
|
# Original processing mode (current behavior)
|
||||||
|
# ... existing logic unchanged
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6. File Structure Changes
|
||||||
|
|
||||||
|
#### New Files:
|
||||||
|
- `core/eye_processor.py` - Dedicated class for eye-specific operations
|
||||||
|
- `utils/video_utils.py` - Video manipulation utilities (splitting, combining)
|
||||||
|
|
||||||
|
#### Modified Files:
|
||||||
|
- All core processing modules as detailed above
|
||||||
|
- Update logging to distinguish left/right eye processing
|
||||||
|
- Update debug frame generation for eye-specific visualization
|
||||||
|
|
||||||
|
### 7. Debug and Monitoring Enhancements
|
||||||
|
|
||||||
|
#### Debug Outputs:
|
||||||
|
- `left_eye_debug.jpg` - Left eye YOLO detections
|
||||||
|
- `right_eye_debug.jpg` - Right eye YOLO detections
|
||||||
|
- `left_eye_sam2_masks.jpg` - Left eye SAM2 results
|
||||||
|
- `right_eye_sam2_masks.jpg` - Right eye SAM2 results
|
||||||
|
- `combined_masks_debug.jpg` - Final combined result
|
||||||
|
|
||||||
|
#### Logging Enhancements:
|
||||||
|
- Clear distinction between left/right eye processing stages
|
||||||
|
- Performance metrics for each eye processing
|
||||||
|
- Fallback trigger logging when no humans detected
|
||||||
|
|
||||||
|
### 8. Performance Considerations
|
||||||
|
|
||||||
|
#### Optimizations:
|
||||||
|
- **Parallel Processing**: Process left and right eyes simultaneously using threading
|
||||||
|
- **Selective Processing**: Skip SAM2 for eyes with no YOLO detections
|
||||||
|
- **Memory Management**: Clean up intermediate eye videos promptly
|
||||||
|
- **Caching**: Cache split eye videos if processing multiple segments
|
||||||
|
|
||||||
|
#### Resource Usage:
|
||||||
|
- **Memory**: ~2x peak usage during eye processing (temporary)
|
||||||
|
- **Storage**: Temporary left/right eye videos (~1.5x original size)
|
||||||
|
- **Compute**: Potentially faster overall due to smaller frame processing
|
||||||
|
|
||||||
|
### 9. Backward Compatibility
|
||||||
|
|
||||||
|
#### Default Behavior:
|
||||||
|
- `separate_eye_processing: false` by default
|
||||||
|
- Existing configurations work unchanged
|
||||||
|
- All current functionality preserved
|
||||||
|
|
||||||
|
#### Migration Path:
|
||||||
|
- Users can gradually test new mode on problematic segments
|
||||||
|
- Configuration flag allows easy A/B testing
|
||||||
|
- Existing debug outputs remain functional
|
||||||
|
|
||||||
|
### 10. Error Handling and Fallbacks
|
||||||
|
|
||||||
|
#### Robust Error Recovery:
|
||||||
|
- If eye splitting fails → fall back to original processing
|
||||||
|
- If single eye SAM2 fails → use greenscreen for that eye
|
||||||
|
- If both eyes fail → full greenscreen segment
|
||||||
|
- Comprehensive logging of all fallback triggers
|
||||||
|
|
||||||
|
#### Quality Validation:
|
||||||
|
- Verify combined masks have reasonable pixel counts
|
||||||
|
- Check for mask alignment issues between eyes
|
||||||
|
- Validate segment completeness before marking done
|
||||||
|
|
||||||
|
## Implementation Priority
|
||||||
|
|
||||||
|
### Phase 1 (Core Functionality)
|
||||||
|
1. Configuration schema updates
|
||||||
|
2. Basic eye splitting and recombining logic
|
||||||
|
3. Modified SAM2 processor with separate eye support
|
||||||
|
4. Greenscreen fallback implementation
|
||||||
|
|
||||||
|
### Phase 2 (Integration)
|
||||||
|
1. Main pipeline integration with new processing mode
|
||||||
|
2. YOLO detector eye-specific enhancements
|
||||||
|
3. Mask processor updates for combined masks
|
||||||
|
4. Basic error handling and fallbacks
|
||||||
|
|
||||||
|
### Phase 3 (Polish)
|
||||||
|
1. Performance optimizations (parallel processing)
|
||||||
|
2. Enhanced debug outputs and logging
|
||||||
|
3. Comprehensive testing and validation
|
||||||
|
4. Documentation updates
|
||||||
|
|
||||||
|
## Expected Benefits
|
||||||
|
|
||||||
|
### Tracking Improvements:
|
||||||
|
- **Eliminated Parallax Confusion**: SAM2 processes single viewpoint per eye
|
||||||
|
- **Better Object Consistency**: Single object tracking per eye view
|
||||||
|
- **Improved Temporal Coherence**: Less cross-eye interference
|
||||||
|
- **Reduced False Positives**: Eye-specific context for tracking
|
||||||
|
|
||||||
|
### Operational Benefits:
|
||||||
|
- **Graceful Degradation**: Full greenscreen when humans not detected
|
||||||
|
- **Flexible Processing**: Can enable/disable per pipeline
|
||||||
|
- **Better Debug Visibility**: Eye-specific debug outputs
|
||||||
|
- **Performance Scalability**: Smaller frames = faster processing per eye
|
||||||
|
|
||||||
|
This plan maintains full backward compatibility while adding the requested separate eye processing capability with robust fallback mechanisms.
|
||||||
122
test-separate-eyes-config.yaml
Normal file
122
test-separate-eyes-config.yaml
Normal file
@@ -0,0 +1,122 @@
|
|||||||
|
# YOLO + SAM2 Video Processing Configuration with VR180 Separate Eye Processing
|
||||||
|
|
||||||
|
input:
|
||||||
|
video_path: "./input/regrets_full.mp4"
|
||||||
|
|
||||||
|
output:
|
||||||
|
directory: "./output/"
|
||||||
|
filename: "vr180_processed_both_eyes.mp4"
|
||||||
|
|
||||||
|
processing:
|
||||||
|
# Duration of each video segment in seconds
|
||||||
|
segment_duration: 5
|
||||||
|
|
||||||
|
# Scale factor for SAM2 inference (0.5 = half resolution)
|
||||||
|
inference_scale: 0.4
|
||||||
|
|
||||||
|
# YOLO detection confidence threshold (lowered for better VR180 detection)
|
||||||
|
yolo_confidence: 0.4
|
||||||
|
|
||||||
|
# Which segments to run YOLO detection on
|
||||||
|
detect_segments: "all"
|
||||||
|
|
||||||
|
# VR180 separate eye processing mode (ENABLED FOR TESTING)
|
||||||
|
separate_eye_processing: false
|
||||||
|
|
||||||
|
# Enable full greenscreen fallback when no humans detected
|
||||||
|
# A value of 0.5 means masks must overlap by 50% to be considered a pair.
|
||||||
|
stereo_iou_threshold: 0.5
|
||||||
|
|
||||||
|
# Factor to reduce YOLO confidence by if no stereo pairs are found on the first try (e.g., 0.8 = 20% reduction).
|
||||||
|
confidence_reduction_factor: 0.8
|
||||||
|
|
||||||
|
# If no humans are detected in a segment, create a full green screen video.
|
||||||
|
# Only used when separate_eye_processing is true.
|
||||||
|
enable_greenscreen_fallback: true
|
||||||
|
|
||||||
|
# Pixel overlap between left/right eyes for blending (0 = no overlap)
|
||||||
|
eye_overlap_pixels: 0
|
||||||
|
|
||||||
|
models:
|
||||||
|
# YOLO detection mode: "detection" (bounding boxes) or "segmentation" (direct masks)
|
||||||
|
yolo_mode: "segmentation" # Default: existing behavior, Options: "detection", "segmentation"
|
||||||
|
|
||||||
|
# YOLO model paths for different modes
|
||||||
|
yolo_detection_model: "models/yolo/yolo11l.pt" # Regular YOLO for detection mode
|
||||||
|
yolo_segmentation_model: "models/yolo/yolo11x-seg.pt" # Segmentation YOLO for segmentation mode
|
||||||
|
|
||||||
|
# SAM2 model configuration
|
||||||
|
sam2_checkpoint: "models/sam2/checkpoints/sam2.1_hiera_small.pt"
|
||||||
|
sam2_config: "models/sam2/configs/sam2.1/sam2.1_hiera_s.yaml"
|
||||||
|
|
||||||
|
video:
|
||||||
|
# Use NVIDIA hardware encoding (requires NVENC-capable GPU)
|
||||||
|
use_nvenc: true
|
||||||
|
|
||||||
|
# Output video bitrate
|
||||||
|
output_bitrate: "25M"
|
||||||
|
|
||||||
|
# Preserve original audio track
|
||||||
|
preserve_audio: true
|
||||||
|
|
||||||
|
# Force keyframes for better segment boundaries
|
||||||
|
force_keyframes: true
|
||||||
|
|
||||||
|
advanced:
|
||||||
|
# Green screen color (RGB values)
|
||||||
|
green_color: [0, 255, 0]
|
||||||
|
|
||||||
|
# Blue screen color for second object (RGB values)
|
||||||
|
blue_color: [255, 0, 0]
|
||||||
|
|
||||||
|
# YOLO human class ID (0 for COCO person class)
|
||||||
|
human_class_id: 0
|
||||||
|
|
||||||
|
# GPU memory management
|
||||||
|
cleanup_intermediate_files: true
|
||||||
|
|
||||||
|
# Logging level (DEBUG, INFO, WARNING, ERROR)
|
||||||
|
log_level: "INFO"
|
||||||
|
|
||||||
|
# Save debug frames with YOLO detections visualized (ENABLED FOR TESTING)
|
||||||
|
save_yolo_debug_frames: true
|
||||||
|
|
||||||
|
# --- Mid-Segment Re-detection ---
|
||||||
|
# Re-run YOLO at intervals within a segment to correct tracking drift.
|
||||||
|
enable_mid_segment_detection: false
|
||||||
|
redetection_interval: 30 # Frames between re-detections.
|
||||||
|
max_redetections_per_segment: 10
|
||||||
|
|
||||||
|
|
||||||
|
# Parallel Processing Optimizations
|
||||||
|
enable_background_lowres_generation: false # Enable async low-res video pre-generation (temporarily disabled due to syntax fix needed)
|
||||||
|
max_concurrent_lowres: 2 # Max parallel FFmpeg processes for low-res creation
|
||||||
|
lowres_segments_ahead: 2 # How many segments to prepare in advance
|
||||||
|
use_ffmpeg_lowres: true # Use FFmpeg instead of OpenCV for low-res creation
|
||||||
|
|
||||||
|
# Mask Quality Enhancement Settings - Optimized for Performance
|
||||||
|
mask_processing:
|
||||||
|
# Edge feathering and blurring (REDUCED for performance)
|
||||||
|
enable_edge_blur: true # Enable Gaussian blur on mask edges for smooth transitions
|
||||||
|
edge_blur_radius: 3 # Reduced from 10 to 3 for better performance
|
||||||
|
edge_blur_sigma: 0.5 # Gaussian blur standard deviation
|
||||||
|
|
||||||
|
# Temporal smoothing between frames
|
||||||
|
enable_temporal_smoothing: false # Enable frame-to-frame mask blending
|
||||||
|
temporal_blend_weight: 0.2 # Weight for previous frame (0.0-1.0, higher = more smoothing)
|
||||||
|
temporal_history_frames: 2 # Number of previous frames to consider
|
||||||
|
|
||||||
|
# Morphological mask cleaning (DISABLED for VR180 - SAM2 masks are already high quality)
|
||||||
|
enable_morphological_cleaning: false # Disabled for performance - SAM2 produces clean masks
|
||||||
|
morphology_kernel_size: 5 # Kernel size for opening/closing operations
|
||||||
|
min_component_size: 500 # Minimum pixel area for connected components
|
||||||
|
|
||||||
|
# Alpha blending mode (OPTIMIZED)
|
||||||
|
alpha_blending_mode: "linear" # Linear is fastest - keep as-is
|
||||||
|
alpha_transition_width: 1 # Width of transition zone in pixels
|
||||||
|
|
||||||
|
# Advanced options
|
||||||
|
enable_bilateral_filter: false # Edge-preserving smoothing (slower but higher quality)
|
||||||
|
bilateral_d: 9 # Bilateral filter diameter
|
||||||
|
bilateral_sigma_color: 75 # Bilateral filter color sigma
|
||||||
|
bilateral_sigma_space: 75 # Bilateral filter space sigma
|
||||||
Reference in New Issue
Block a user