Video Analytics System for Natural Language-Driven Surveillance

Problem: Traditional surveillance systems require manual rule-setting and lack adaptability to dynamic environments like ports or industrial sites. They often fail in challenging scenarios like occlusions, aerial views, or ambiguous human-object interactions.

Solution & Architecture: This system enables natural language-driven video analysis using a pipeline of:

System Architecture:

Experiments & Observations: The team evaluated multiple VideoLLMs across scenarios (with/without blur/highlighting), revealing hallucination risks and gaps in temporal reasoning. Prompt engineering, object-aware attention, and subtask chaining significantly improved relevance.

Impact: The proposed architecture supports near-real-time analytics in industrial setups with minimal rule authoring. It offers an extensible framework for querying unstructured video via conversational input—helping ensure safety (e.g., worker too close to moving container) and increasing automation in surveillance tasks.