samyolo_on_segments/spec.md

# YOLO + SAM2 Video Processing Pipeline

## Overview

This project provides an automated video processing pipeline that uses YOLO for human detection and SAM2 for precise segmentation to create green screen videos. The system processes long videos by splitting them into manageable segments, detecting and tracking humans in each segment, and then reassembling the processed segments into a final output video with preserved audio.

## Core Functionality

### Input
- **Long video file** (MP4 format, any duration)
- **Configuration file** (YAML format) specifying processing parameters

### Output
- **Processed video file** with humans visible and background replaced with green screen
- **Preserved audio** from the original input video
- **Intermediate files** for debugging and quality control

## Processing Pipeline

### 1. Video Segmentation
- Splits input video into configurable-duration segments (default: 5 seconds)
- Creates organized directory structure: `segment_0/`, `segment_1/`, etc.
- Each segment folder contains the segment video file
- Generates force keyframes for consistent encoding

### 2. Human Detection & Tracking
- **YOLO Detection**: Automatically detects humans in keyframe segments using YOLOv8
- **SAM2 Segmentation**: Uses detected bounding boxes as prompts for precise mask generation
- **Mask Propagation**: Propagates masks across all frames in each segment
- **Stereo Video Support**: Handles VR/stereo content with left/right human assignment
- **Continuity**: Non-keyframe segments use previous segment masks for consistency

### 3. Green Screen Processing
- **Mask Application**: Applies generated masks to isolate humans
- **Background Replacement**: Replaces non-human areas with green screen (RGB: 0,255,0)
- **GPU Acceleration**: Uses CuPy for fast mask processing
- **Multi-resolution**: Low-res inference for speed, full-res final rendering

### 4. Video Assembly
- **Segment Concatenation**: Combines all processed segments into single video
- **Audio Preservation**: Copies original audio track to final output
- **Quality Maintenance**: Preserves original video quality and framerate

## Key Features

### Automated Processing
- **No Manual Intervention**: Fully automated human detection eliminates manual point selection
- **Batch Processing**: Processes multiple segments efficiently
- **Smart Fallback**: Robust mask propagation with intelligent previous-segment loading

### Modular Architecture
- **Configuration-Driven**: YAML-based configuration for easy parameter adjustment
- **Extensible Design**: Modular structure allows for easy feature additions
- **Error Recovery**: Graceful handling of detection failures and missing segments

### Performance Optimizations
- **GPU Acceleration**: CUDA/NVENC support for faster processing
- **Memory Management**: Efficient handling of large videos through segmentation
- **Concurrent Processing**: Thread-safe operations where applicable

## Technical Stack

### Core Dependencies
- **SAM2**: Facebook's Segment Anything Model 2 for precise segmentation
- **YOLOv8 (Ultralytics)**: Human detection and bounding box generation
- **OpenCV**: Video processing and frame manipulation
- **CuPy**: GPU-accelerated array operations
- **FFmpeg**: Video encoding/decoding and audio handling
- **PyTorch**: Deep learning framework backend

### Supported Formats
- **Input Video**: MP4, AVI, MOV (any OpenCV-supported format)
- **Output Video**: MP4 with H.265/HEVC encoding
- **Audio**: Preserves original audio codec and quality

## Configuration Options

### Video Processing
- `segment_duration`: Duration of each video segment (seconds)
- `inference_scale`: Scale factor for SAM2 inference (for speed)
- `output_scale`: Scale factor for final output

### Detection Parameters
- `yolo_model`: Path to YOLO model weights
- `yolo_confidence`: Detection confidence threshold
- `detect_segments`: Which segments to run YOLO detection on

### SAM2 Parameters
- `sam2_checkpoint`: Path to SAM2 model weights
- `sam2_config`: SAM2 model configuration file

### Output Options
- `use_nvenc`: Enable NVIDIA hardware encoding
- `output_bitrate`: Video bitrate for final output
- `preserve_audio`: Whether to copy audio track

## Directory Structure

```
new_yolo/
├── spec.md                 # This specification document
├── requirements.txt        # Python dependencies
├── config.yaml            # Default configuration file
├── main.py                # Entry point script
├── core/
│   ├── __init__.py
│   ├── video_splitter.py   # Video segmentation logic
│   ├── yolo_detector.py    # YOLO human detection
│   ├── sam2_processor.py   # SAM2 segmentation
│   ├── mask_processor.py   # Mask application and green screen
│   ├── video_assembler.py  # Final video assembly
│   └── config_loader.py    # Configuration management
├── utils/
│   ├── __init__.py
│   ├── file_utils.py       # File system operations
│   ├── video_utils.py      # Video processing utilities
│   └── logging_utils.py    # Logging configuration
└── examples/
    ├── basic_config.yaml   # Example configuration
    └── advanced_config.yaml # Advanced configuration options
```

## Usage Examples

### Basic Usage
```bash
python main.py --config config.yaml
```

### Custom Configuration
```bash
python main.py --config examples/advanced_config.yaml
```

### Configuration File Example
```yaml
input:
  video_path: "/path/to/input/video.mp4"

output:
  directory: "/path/to/output/"
  filename: "processed_video.mp4"

processing:
  segment_duration: 5
  inference_scale: 0.5
  yolo_confidence: 0.6
  detect_segments: "all"  # or [0, 5, 10]

models:
  yolo_model: "yolov8n.pt"
  sam2_checkpoint: "../checkpoints/sam2.1_hiera_large.pt"
  sam2_config: "configs/sam2.1/sam2.1_hiera_l.yaml"
```

## Use Cases

### Content Creation
- **VR/360 Video Processing**: Remove backgrounds from immersive content
- **Green Screen Production**: Automated background removal for video production
- **Social Media Content**: Quick background replacement for content creators

### Commercial Applications
- **Video Conferencing**: Real-time background replacement
- **E-learning**: Professional video production with clean backgrounds
- **Marketing**: Product demonstration videos with custom backgrounds

## Performance Considerations

### Hardware Requirements
- **GPU**: NVIDIA GPU with CUDA support (recommended)
- **RAM**: 16GB+ for processing large videos
- **Storage**: SSD recommended for temporary file operations

### Processing Time
- Approximately **1-2x real-time** on modern GPUs
- Scales with video resolution and segment count
- Memory usage remains constant regardless of input video length

## Future Enhancements

### Planned Features
- **Multi-object Tracking**: Support for multiple humans per frame
- **Custom Object Detection**: Configurable object classes beyond humans
- **Real-time Processing**: Live video stream support
- **Cloud Integration**: AWS/GCP processing support
- **Web Interface**: Browser-based configuration and monitoring

### Model Improvements
- **Fine-tuned YOLO**: Domain-specific human detection models
- **SAM2 Optimization**: Custom SAM2 checkpoints for video content
- **Temporal Consistency**: Enhanced cross-segment mask propagation


Here is the original monolithic script this repo is a refactor/modularization of. If something
doesn't work in this repo, then consult the following script becasue it does work so this can
be used to solve problems:


import os
import cv2
import numpy as np
import cupy as cp
from concurrent.futures import ThreadPoolExecutor
import torch
import logging
import sys
import gc
from sam2.build_sam import build_sam2_video_predictor
import argparse
from ultralytics import YOLO

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"

# Variables for input and output directories
SAM2_CHECKPOINT = "../checkpoints/sam2.1_hiera_large.pt"
MODEL_CFG = "configs/sam2.1/sam2.1_hiera_l.yaml"
GREEN = [0, 255, 0]
BLUE = [255, 0, 0]

INFERENCE_SCALE = 0.50
FULL_SCALE = 1.0

# YOLO model for human detection (class 0 = person)
YOLO_MODEL_PATH = "yolov8n.pt"  # You can change this to a custom model
YOLO_CONFIDENCE = 0.6
HUMAN_CLASS_ID = 0  # COCO class ID for person

def open_video(video_path):
    """
    Opens a video file and returns a generator that yields frames.

    Parameters:
    - video_path: Path to the video file.

    Returns:
    - A generator that yields frames from the video.
    """
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        print(f"Error: Could not open video file {video_path}")
        return
    while True:
        ret, frame = cap.read()
        if not ret:
            break
        yield frame
    cap.release()

def load_previous_segment_mask(prev_segment_dir):
    mask_path = os.path.join(prev_segment_dir,  "mask.png")
    mask_image = cv2.imread(mask_path)

    if mask_image is None:
        raise FileNotFoundError(f"Mask image not found at {mask_path}")

    # Ensure the mask_image has three color channels
    if len(mask_image.shape) != 3 or mask_image.shape[2] != 3:
        raise ValueError("Mask image does not have three color channels.")

    mask_image = mask_image.astype(np.uint8)

    # Extract Object A and Object B masks
    mask_a = np.all(mask_image == GREEN, axis=2)
    mask_b = np.all(mask_image == BLUE, axis=2)

    per_obj_input_mask = {1: mask_a, 2: mask_b}
    input_palette = None  # No palette needed for binary mask

    return per_obj_input_mask, input_palette


def apply_green_mask(frame, masks):
    # Convert frame and masks to CuPy arrays
    frame_gpu = cp.asarray(frame)
    combined_mask = cp.zeros(frame_gpu.shape[:2], dtype=cp.bool_)

    for mask in masks:
        mask_gpu = cp.asarray(mask.squeeze())
        if mask_gpu.shape != frame_gpu.shape[:2]:
            resized_mask = cv2.resize(cp.asnumpy(mask_gpu).astype(cp.float32),
                                      (frame_gpu.shape[1], frame_gpu.shape[0]))
            mask_gpu = cp.asarray(resized_mask > 0.5)  # Convert back to CuPy boolean array
        else:
            mask_gpu = mask_gpu.astype(cp.bool_)  # Ensure boolean type
        combined_mask |= mask_gpu  # Perform the bitwise OR operation

    green_background = cp.full(frame_gpu.shape, cp.array([0, 255, 0], dtype=cp.uint8), dtype=cp.uint8)
    result_frame = cp.where(combined_mask[..., None], frame_gpu, green_background)
    return cp.asnumpy(result_frame)  # Convert back to NumPy


def initialize_predictor():
    if torch.cuda.is_available():
        device = torch.device("cuda")
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        print(
            "\nSupport for MPS devices is preliminary. SAM 2 is trained with CUDA and might "
            "give numerically different outputs and sometimes degraded performance on MPS."
        )
        # Enable MPS fallback for operations not supported on MPS
        os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1"
    else:
        device = torch.device("cpu")
    logger.info(f"Using device: {device}")
    predictor = build_sam2_video_predictor(MODEL_CFG, SAM2_CHECKPOINT, device=device)
    return predictor


def load_first_frame(video_path, scale=1.0):
    """
    Opens a video file and returns the first frame, scaled as specified.

    Parameters:
    - video_path: Path to the video file.
    - scale: Scaling factor for the frame (default is 1.0 for original size).

    Returns:
    - first_frame: The first frame of the video, scaled accordingly.
    """
    cap = cv2.VideoCapture(video_path)
    if not cap.isOpened():
        logger.error(f"Error: Could not open video file {video_path}")
        return None

    ret, frame = cap.read()
    cap.release()

    if not ret:
        logger.error(f"Error: Could not read frame from video file {video_path}")
        return None

    if scale != 1.0:
        frame = cv2.resize(
            frame, None, fx=scale, fy=scale, interpolation=cv2.INTER_LINEAR
        )

    return frame

def detect_humans_with_yolo(frame, yolo_model, confidence_threshold=YOLO_CONFIDENCE):
    """
    Detect humans in a frame using YOLO model.

    Parameters:
    - frame: Input frame (BGR format)
    - yolo_model: Loaded YOLO model
    - confidence_threshold: Detection confidence threshold

    Returns:
    - human_boxes: List of bounding boxes for detected humans
    """
    # Run YOLO detection
    results = yolo_model(frame, conf=confidence_threshold, verbose=False)

    human_boxes = []

    # Process results
    for result in results:
        boxes = result.boxes
        if boxes is not None:
            for box in boxes:
                # Get class ID
                cls = int(box.cls.cpu().numpy()[0])

                # Check if it's a person (class 0 in COCO)
                if cls == HUMAN_CLASS_ID:
                    # Get bounding box coordinates (x1, y1, x2, y2)
                    coords = box.xyxy[0].cpu().numpy()
                    conf = float(box.conf.cpu().numpy()[0])

                    human_boxes.append({
                        'bbox': coords,
                        'confidence': conf
                    })

                    logger.info(f"Detected human with confidence {conf:.2f} at {coords}")

    return human_boxes

def add_yolo_detections_to_predictor(predictor, inference_state, human_detections, frame_width):
    """
    Add YOLO human detections as bounding boxes to SAM2 predictor.
    For stereo videos, creates two objects (left and right humans).

    Parameters:
    - predictor: SAM2 video predictor
    - inference_state: SAM2 inference state
    - human_detections: List of human detection results
    - frame_width: Width of the frame for stereo splitting

    Returns:
    - out_mask_logits: SAM2 output mask logits
    """
    half_frame_width = frame_width // 2

    # Sort detections by x-coordinate to get left and right humans
    human_detections.sort(key=lambda x: x['bbox'][0])  # Sort by x1 coordinate

    obj_id = 1
    out_mask_logits = None

    for i, detection in enumerate(human_detections[:2]):  # Take up to 2 humans (left and right)
        bbox = detection['bbox']

        # For stereo videos, assign obj_id based on position
        if len(human_detections) >= 2:
            # If we have multiple humans, assign based on left/right position
            center_x = (bbox[0] + bbox[2]) / 2
            if center_x < half_frame_width:
                current_obj_id = 1  # Left human
            else:
                current_obj_id = 2  # Right human
        else:
            # If only one human, duplicate for both sides (as in original stereo logic)
            current_obj_id = obj_id
            obj_id += 1

            # Also add the mirrored version for stereo
            if obj_id <= 2:
                mirrored_bbox = bbox.copy()
                mirrored_bbox[0] += half_frame_width  # Shift x1
                mirrored_bbox[2] += half_frame_width  # Shift x2

                # Ensure mirrored bbox is within frame bounds
                mirrored_bbox[0] = max(0, min(mirrored_bbox[0], frame_width - 1))
                mirrored_bbox[2] = max(0, min(mirrored_bbox[2], frame_width - 1))

                try:
                    _, out_obj_ids, out_mask_logits = predictor.add_new_points_or_box(
                        inference_state=inference_state,
                        frame_idx=0,
                        obj_id=obj_id,
                        box=mirrored_bbox.astype(np.float32),
                    )
                    logger.info(f"Added mirrored human detection for Object {obj_id}")
                    obj_id += 1
                except Exception as e:
                    logger.error(f"Error adding mirrored human detection for Object {obj_id}: {e}")

        try:
            _, out_obj_ids, out_mask_logits = predictor.add_new_points_or_box(
                inference_state=inference_state,
                frame_idx=0,
                obj_id=current_obj_id,
                box=bbox.astype(np.float32),
            )
            logger.info(f"Added human detection for Object {current_obj_id}")
        except Exception as e:
            logger.error(f"Error adding human detection for Object {current_obj_id}: {e}")

    return out_mask_logits

def propagate_masks(predictor, inference_state):
    video_segments = {}
    for out_frame_idx, out_obj_ids, out_mask_logits in predictor.propagate_in_video(inference_state):
        video_segments[out_frame_idx] = {
            out_obj_id: (out_mask_logits[i] > 0.0).cpu().numpy()
            for i, out_obj_id in enumerate(out_obj_ids)
        }
    return video_segments

def apply_colored_mask(frame, masks_a, masks_b):
    colored_mask = np.zeros_like(frame)

    # Apply colors to the masks
    for mask in masks_a:
        mask = mask.squeeze()
        if mask.shape != frame.shape[:2]:
            mask = cv2.resize(mask, (frame.shape[1], frame.shape[0]), interpolation=cv2.INTER_NEAREST)
        indices = np.where(mask)
        colored_mask[mask] = [0, 255, 0]  # Green for Object A

    for mask in masks_b:
        mask = mask.squeeze()
        if mask.shape != frame.shape[:2]:
            mask = cv2.resize(mask, (frame.shape[1], frame.shape[0]), interpolation=cv2.INTER_NEAREST)
        indices = np.where(mask)
        colored_mask[mask] = [255, 0, 0]  # Blue for Object B

    return colored_mask


def process_and_save_output_video(video_path, output_video_path, video_segments, use_nvenc=False):
    """
    Process high-resolution frames, apply upscaled masks, and save the output video.
    """
    cap = cv2.VideoCapture(video_path)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS) or 59.94

    # Setup VideoWriter with desired settings
    if use_nvenc:
        # Use FFmpeg with NVENC offloading for H.265 encoding
        import subprocess

        if sys.platform == 'darwin':
            encoder = 'hevc_videotoolbox'
        else:
            encoder = 'hevc_nvenc'

        command = [
            'ffmpeg',
            '-y',  # Overwrite output file if it exists
            '-f', 'rawvideo',
            '-vcodec', 'rawvideo',
            '-pix_fmt', 'bgr24',
            '-s', f'{frame_width}x{frame_height}',
            '-r', str(fps),
            '-i', '-',  # Input from stdin
            '-an',  # No audio
            '-vcodec', encoder,
            '-pix_fmt', 'nv12',
            '-preset', 'slow',
            '-b:v', '50M',
            output_video_path
        ]
        process = subprocess.Popen(command, stdin=subprocess.PIPE)
    else:
        # Use OpenCV VideoWriter
        fourcc = cv2.VideoWriter_fourcc(*'HEVC')  # H.265
        out = cv2.VideoWriter(output_video_path, fourcc, fps, (frame_width, frame_height))

    frame_idx = 0
    while True:
        ret, frame = cap.read()
        if not ret or frame_idx >= len(video_segments):
            break

        masks = [video_segments[frame_idx][out_obj_id] for out_obj_id in video_segments[frame_idx]]
        upscaled_masks = []

        for mask in masks:
            mask = mask.squeeze()
            upscaled_mask = cv2.resize(mask.astype(np.uint8), (frame.shape[1], frame.shape[0]), interpolation=cv2.INTER_NEAREST)
            upscaled_masks.append(upscaled_mask)

        result_frame = apply_green_mask(frame, upscaled_masks)

        # Write frame to output
        if use_nvenc:
            process.stdin.write(result_frame.tobytes())
        else:
            out.write(result_frame)

        frame_idx += 1

    cap.release()
    if use_nvenc:
        process.stdin.close()
        process.wait()
    else:
        out.release()

def get_video_file_name(index):
    return f"segment_{str(index).zfill(3)}.mp4"

def do_yolo_detection_on_segments(base_dir, segments, detect_segments, scale=1.0, yolo_model_path=YOLO_MODEL_PATH):
    """
    Run YOLO detection on specified segments and save detection results.
    """
    logger.info("Running YOLO detection on requested segments.")

    # Load YOLO model
    yolo_model = YOLO(yolo_model_path)

    for i, segment in enumerate(segments):
        segment_index = int(segment.split("_")[1])
        segment_dir = os.path.join(base_dir, segment)
        detection_file = os.path.join(segment_dir, "yolo_detections")
        video_file = os.path.join(segment_dir, get_video_file_name(i))

        if segment_index in detect_segments and not os.path.exists(detection_file):
            first_frame = load_first_frame(video_file, scale)
            if first_frame is None:
                continue

            # Convert BGR to RGB for YOLO (YOLO expects BGR, so keep as BGR)
            human_detections = detect_humans_with_yolo(first_frame, yolo_model)

            if human_detections:
                # Save detection results
                with open(detection_file, 'w') as f:
                    f.write("# YOLO Human Detections\n")
                    for detection in human_detections:
                        bbox = detection['bbox']
                        conf = detection['confidence']
                        f.write(f"{bbox[0]},{bbox[1]},{bbox[2]},{bbox[3]},{conf}\n")
                logger.info(f"Saved {len(human_detections)} human detections for segment {segment}")
            else:
                logger.warning(f"No humans detected in segment {segment}")
                # Create empty file to mark as processed
                with open(detection_file, 'w') as f:
                    f.write("# No humans detected\n")

def save_final_masks(video_segments, mask_output_path):
    """
    Save the final masks as a colored image.
    """
    last_frame_idx = max(video_segments.keys())
    masks_dict = video_segments[last_frame_idx]
    # Assuming you have two objects with IDs 1 and 2
    mask_a = masks_dict.get(1).squeeze() if 1 in masks_dict else None
    mask_b = masks_dict.get(2).squeeze() if 2 in masks_dict else None

    if mask_a is None and mask_b is None:
        logger.error("No masks found for objects.")
        return

    # Use the first available mask to determine dimensions
    reference_mask = mask_a if mask_a is not None else mask_b
    black_frame = np.zeros((reference_mask.shape[0], reference_mask.shape[1], 3), dtype=np.uint8)

    if mask_a is not None:
        mask_a = mask_a.astype(bool)
        black_frame[mask_a] = GREEN

    if mask_b is not None:
        mask_b = mask_b.astype(bool)
        black_frame[mask_b] = BLUE

    # Save the mask image
    cv2.imwrite(mask_output_path, black_frame)
    logger.info(f"Saved final masks to {mask_output_path}")

def create_low_res_video(input_video_path, output_video_path, scale):
    """
    Creates a low-resolution version of the input video for inference.
    """
    cap = cv2.VideoCapture(input_video_path)
    frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH) * scale)
    frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT) * scale)
    fps = cap.get(cv2.CAP_PROP_FPS) or 59.94

    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_video_path, fourcc, fps, (frame_width, frame_height))

    while True:
        ret, frame = cap.read()
        if not ret:
            break
        low_res_frame = cv2.resize(frame, (frame_width, frame_height), interpolation=cv2.INTER_LINEAR)
        out.write(low_res_frame)

    cap.release()
    out.release()


def main():
    parser = argparse.ArgumentParser(description="Process video segments with YOLO + SAM2.")
    parser.add_argument("--base-dir", type=str, help="Base directory for video segments.")
    parser.add_argument("--segments-detect-humans", nargs='*', help="Segments for which to run YOLO human detection. Use 'all' for all segments, or list specific segment numbers (e.g., 1 5 10). Default: all segments.")
    parser.add_argument("--yolo-model", type=str, default=YOLO_MODEL_PATH, help="Path to YOLO model.")
    parser.add_argument("--yolo-confidence", type=float, default=YOLO_CONFIDENCE, help="YOLO detection confidence threshold.")
    args = parser.parse_args()

    base_dir = args.base_dir
    segments = [d for d in os.listdir(base_dir) if os.path.isdir(os.path.join(base_dir, d)) and d.startswith("segment_")]
    segments.sort(key=lambda x: int(x.split("_")[1]))

    # Handle different ways to specify segments for YOLO detection
    if args.segments_detect_humans is None or len(args.segments_detect_humans) == 0:
        # Default: run YOLO on all segments
        detect_segments = [int(seg.split("_")[1]) for seg in segments]
        logger.info("No segments specified, running YOLO detection on ALL segments")
    elif len(args.segments_detect_humans) == 1 and args.segments_detect_humans[0].lower() == 'all':
        # Explicit 'all' keyword
        detect_segments = [int(seg.split("_")[1]) for seg in segments]
        logger.info("Running YOLO detection on ALL segments")
    else:
        # Specific segment numbers provided
        try:
            detect_segments = [int(x) for x in args.segments_detect_humans]
            logger.info(f"Running YOLO detection on segments: {detect_segments}")
        except ValueError:
            logger.error("Invalid segment numbers provided. Use integers or 'all'.")
            return

    # Run YOLO detection on specified segments
    do_yolo_detection_on_segments(base_dir, segments, detect_segments, scale=INFERENCE_SCALE, yolo_model_path=args.yolo_model)

    # Load YOLO model for inference
    yolo_model = YOLO(args.yolo_model)

    for i, segment in enumerate(segments):
        segment_index = int(segment.split("_")[1])
        segment_dir = os.path.join(base_dir, segment)
        video_file_name = get_video_file_name(i)
        video_path = os.path.join(segment_dir, video_file_name)
        output_done_file = os.path.join(segment_dir, "output_frames_done")

        if os.path.exists(output_done_file):
            logger.info(f"Segment {segment} already processed. Skipping.")
            continue

        logger.info(f"Processing segment {segment}")

        # Initialize predictor
        predictor = initialize_predictor()

        # Prepare low-resolution video frames for inference
        low_res_video_path = os.path.join(segment_dir, "low_res_video.mp4")
        if not os.path.exists(low_res_video_path):
            create_low_res_video(video_path, low_res_video_path, INFERENCE_SCALE)
            logger.info(f"Low-resolution video created for segment {segment}")
        else:
            logger.info(f"Low-resolution video already exists for segment {segment}, reuse")

        # Initialize inference state with low-resolution video
        inference_state = predictor.init_state(video_path=low_res_video_path, async_loading_frames=True)

        # Load YOLO detections or previous masks
        detection_file = os.path.join(segment_dir, "yolo_detections")
        use_detections = segment_index in detect_segments

        if i == 0 and not use_detections:
            # First segment must use YOLO detection since there's no previous mask
            logger.warning(f"First segment {segment} requires YOLO detection. Running YOLO detection.")
            use_detections = True

        if i > 0 and not use_detections:
            # Try to load previous segment mask - search backwards for the most recent successful mask
            logger.info(f"Using previous segment mask for segment {segment}")
            mask_found = False

            # Search backwards through previous segments to find a valid mask
            for j in range(i - 1, -1, -1):
                prev_segment_dir = os.path.join(base_dir, segments[j])
                prev_mask_path = os.path.join(prev_segment_dir, "mask.png")

                if os.path.exists(prev_mask_path):
                    try:
                        per_obj_input_mask, input_palette = load_previous_segment_mask(prev_segment_dir)
                        # Add previous masks to predictor
                        for obj_id, mask in per_obj_input_mask.items():
                            predictor.add_new_mask(inference_state, 0, obj_id, mask)
                        logger.info(f"Successfully loaded mask from segment {segments[j]}")
                        mask_found = True
                        break
                    except Exception as e:
                        logger.warning(f"Error loading mask from {segments[j]}: {e}")
                        continue

            if not mask_found:
                logger.error(f"No valid previous mask found for segment {segment}. Consider running YOLO detection on this segment.")
                continue
        else:
            # Load first frame for detection
            first_frame = load_first_frame(low_res_video_path, scale=1.0)
            if first_frame is None:
                logger.error(f"Could not load first frame for segment {segment}")
                continue

            # Run YOLO detection on first frame (either from file or on-the-fly)
            if os.path.exists(detection_file):
                logger.info(f"Using existing YOLO detections for segment {segment}")
            else:
                logger.info(f"Running YOLO detection on-the-fly for segment {segment}")

            human_detections = detect_humans_with_yolo(first_frame, yolo_model, args.yolo_confidence)

            if human_detections:
                # Add YOLO detections to predictor
                frame_width = first_frame.shape[1]
                add_yolo_detections_to_predictor(predictor, inference_state, human_detections, frame_width)
            else:
                logger.warning(f"No humans detected in segment {segment}")
                continue

        # Perform inference and collect masks per frame
        video_segments = propagate_masks(predictor, inference_state)

        # Process high-resolution frames and save output video
        output_video_path = os.path.join(segment_dir, f"output_{segment_index}.mp4")
        logger.info("Processing segment complete, attempting to save full video from low res masks")
        process_and_save_output_video(
            video_path,
            output_video_path,
            video_segments,
            use_nvenc=True  # Set to True to use NVENC offloading
        )

        # Save final masks
        mask_output_path = os.path.join(segment_dir, "mask.png")
        save_final_masks(video_segments, mask_output_path)

        # Clean up
        predictor.reset_state(inference_state)
        del inference_state
        del video_segments
        del predictor
        gc.collect()

        try:
            os.remove(low_res_video_path)
            logger.info(f"Deleted low-resolution video for segment {segment}")
        except Exception as e:
            logger.warning(f"Could not delete low-resolution video for segment {segment}: {e}")

        # Mark segment as completed
        open(output_done_file, 'a').close()

    logger.info("Processing complete.")

if __name__ == "__main__":
    main()