optimizations A round 1

2025-07-26 11:04:04 -07:00
parent 40ae537f7a
commit b642b562f0
4 changed files with 353 additions and 13 deletions
--- a/spec.md
+++ b/spec.md
@@ -123,6 +123,204 @@ hardware:
 3. **Performance Profiling**: Detailed resource usage analytics
 4. **Quality Validation**: Comprehensive testing suite

+## Post-Implementation Optimization Opportunities
+
+*Based on first successful 30-second test clip execution results (A40 GPU, 50% scale, 9x200 frame chunks)*
+
+### Performance Analysis Findings
+- **Processing Speed**: ~0.54s per frame (64.4s for 120 frames per chunk)
+- **VRAM Utilization**: Only 2.5% (1.11GB of 45GB available) - significantly underutilized
+- **RAM Usage**: 106GB used of 494GB available (21.5%)
+- **Primary Bottleneck**: Intermediate ffmpeg encoding operations per chunk
+
+### Identified Optimization Categories
+
+#### Category A: Performance Improvements (Quick Wins)
+1. **Audio Track Preservation** ⚠️ **CRITICAL**
+   - Issue: Output video missing audio track from input
+   - Solution: Use ffmpeg to copy audio stream during final video creation
+   - Implementation: Add `-c:a copy` to final ffmpeg command
+   - Impact: Essential for production usability
+   - Risk: Low, standard ffmpeg operation
+
+2. **Frame Count Synchronization** ⚠️ **CRITICAL**
+   - Issue: Audio sync drift if input/output frame counts differ
+   - Solution: Validate exact frame count preservation throughout pipeline
+   - Implementation: Frame count verification + duration matching
+   - Impact: Prevents audio desync in long videos
+   - Risk: Low, validation feature
+
+3. **Memory Usage Reality Check** ⚠️ **IMPORTANT**
+   - Current assumption: Unlimited RAM for memory-only pipeline
+   - Reality: RunPod container limited to ~48GB RAM
+   - Risk calculation: 1-hour video = ~213k frames = potential 20-40GB+ memory usage
+   - Solution: Implement streaming output instead of full in-memory accumulation
+   - Impact: Enables processing of long-form content
+   - Risk: Medium, requires pipeline restructuring
+
+4. **Larger Chunk Sizes**
+   - Current: 200 frames per chunk (conservative for 10GB RTX 3080)
+   - Opportunity: 600-800 frames per chunk on high-VRAM systems
+   - Impact: Reduce 9 chunks to 2-3 chunks, fewer intermediate operations
+   - Risk: Low, easily configurable
+
+5. **Streaming Output Pipeline**
+   - Current: Accumulate all processed frames in memory, write once
+   - Opportunity: Write processed chunks to temporary segments, merge at end
+   - Impact: Constant memory usage regardless of video length
+   - Risk: Medium, requires temporary file management
+
+6. **Enhanced Performance Profiling**
+   - Current: Basic memory monitoring
+   - Opportunity: Detailed timing per processing stage (detection, propagation, encoding)
+   - Impact: Identify exact bottlenecks for targeted optimization
+   - Risk: Low, debugging feature
+
+7. **Parallel Eye Processing**
+   - Current: Sequential left eye → right eye processing
+   - Opportunity: Process both eyes simultaneously
+   - Impact: Potential 50% speedup, better GPU utilization
+   - Risk: Medium, memory management complexity
+
+#### Category B: Stereo Consistency Fixes (Critical for VR)
+1. **Master-Slave Eye Processing**
+   - Issue: Independent detection leads to mismatched person counts between eyes
+   - Solution: Use left eye detections as "seeds" for right eye processing
+   - Impact: Ensures identical person detection across stereo pair
+   - Risk: Low, maintains current quality while improving consistency
+
+2. **Cross-Eye Detection Validation**
+   - Issue: Hair/clothing included on one eye but not the other
+   - Solution: Compare detection results, flag inconsistencies for reprocessing
+   - Impact: 90%+ stereo alignment improvement
+   - Risk: Low, fallback to current behavior
+
+3. **Disparity-Aware Segmentation**
+   - Issue: Segmentation boundaries differ between eyes despite same person
+   - Solution: Use stereo disparity to correlate features between eyes
+   - Impact: True stereo-consistent matting
+   - Risk: High, complex implementation
+
+4. **Joint Stereo Detection**
+   - Issue: YOLO runs independently on each eye
+   - Solution: Run YOLO on full SBS frame, split detections spatially
+   - Impact: Guaranteed identical detection counts
+   - Risk: Medium, requires detection coordinate mapping
+
+#### Category C: Advanced Optimizations (Future)
+1. **Adaptive Memory Management**
+   - Opportunity: Dynamic chunk sizing based on real-time VRAM usage
+   - Impact: Optimal resource utilization across different hardware
+   - Risk: Medium, complex heuristics
+
+2. **Multi-Resolution Processing**
+   - Opportunity: Initial processing at lower resolution, edge refinement at full
+   - Impact: Speed improvement while maintaining quality
+   - Risk: Medium, quality validation required
+
+3. **Enhanced Workflow Documentation**
+   - Issue: Unclear intermediate data lifecycle
+   - Solution: Detailed logging of chunk processing, optional intermediate preservation
+   - Impact: Better debugging and user understanding
+   - Risk: Low, documentation feature
+
+### Implementation Strategy
+- **Phase A**: Quick performance wins (larger chunks, profiling)
+- **Phase B**: Stereo consistency (master-slave, validation)  
+- **Phase C**: Advanced features (disparity-aware, memory optimization)
+
+### Configuration Extensions Required
+```yaml
+processing:
+  chunk_size: 600  # Increase from 200 for high-VRAM systems
+  memory_pipeline: false  # Skip intermediate video creation (disabled due to RAM limits)
+  streaming_output: true  # Write chunks progressively instead of accumulating
+  parallel_eyes: false  # Process eyes simultaneously
+  max_memory_gb: 40  # Realistic RAM limit for RunPod containers
+  
+audio:
+  preserve_audio: true  # Copy audio track from input to output
+  verify_sync: true  # Validate frame count and duration matching
+  audio_codec: "copy"  # Preserve original audio codec
+  
+stereo:
+  consistency_mode: "master_slave"  # "independent", "master_slave", "joint"
+  validation_threshold: 0.8  # Similarity threshold between eyes
+  correction_method: "transfer"  # "transfer", "reprocess", "ensemble"
+  
+performance:
+  profile_enabled: true  # Detailed timing analysis
+  preserve_intermediates: false  # For debugging workflow
+
+debugging:
+  log_intermediate_workflow: true  # Document chunk lifecycle
+  save_detection_visualization: false  # Debug detection mismatches
+  frame_count_validation: true  # Ensure exact frame preservation
+```
+
+### Technical Implementation Details
+
+#### Audio Preservation Implementation
+```python
+# During final video save, include audio stream copy
+ffmpeg_cmd = [
+    'ffmpeg', '-y',
+    '-framerate', str(fps),
+    '-i', frame_pattern,           # Video frames
+    '-i', input_video_path,        # Original video for audio
+    '-c:v', 'h264_nvenc',         # GPU video codec (with CPU fallback)
+    '-c:a', 'copy',               # Copy audio without re-encoding
+    '-map', '0:v:0',              # Map video from first input
+    '-map', '1:a:0',              # Map audio from second input
+    '-shortest',                  # Match shortest stream duration
+    output_path
+]
+```
+
+#### Streaming Output Implementation
+```python
+# Instead of accumulating frames in memory:
+class StreamingVideoWriter:
+    def __init__(self, output_path, fps, audio_source):
+        self.temp_segments = []
+        self.current_segment = 0
+        
+    def write_chunk(self, processed_frames):
+        # Write chunk to temporary segment
+        segment_path = f"temp_segment_{self.current_segment}.mp4"
+        self.write_video_segment(processed_frames, segment_path)
+        self.temp_segments.append(segment_path)
+        self.current_segment += 1
+        
+    def finalize(self):
+        # Merge all segments with audio preservation
+        self.merge_segments_with_audio()
+```
+
+#### Memory Usage Calculation
+```python
+def estimate_memory_requirements(duration_seconds, fps, resolution_scale=0.5):
+    """Calculate memory usage for different video lengths"""
+    frames = duration_seconds * fps
+    
+    # Per-frame memory (rough estimates for VR180 at 50% scale)
+    frame_size_mb = (3072 * 1536 * 3 * 4) / (1024 * 1024)  # ~18MB per frame
+    
+    total_memory_gb = (frames * frame_size_mb) / 1024
+    
+    return {
+        'duration': duration_seconds,
+        'total_frames': frames,
+        'estimated_memory_gb': total_memory_gb,
+        'safe_for_48gb': total_memory_gb < 40
+    }
+
+# Example outputs:
+# 30 seconds: ~2.7GB (safe)
+# 5 minutes: ~27GB (borderline) 
+# 1 hour: ~324GB (requires streaming)
+```
+
 ## Success Criteria

 ### Technical Feasibility