Computer Vision › Processing ›

Video Understanding

1592 directly classified papers

Papers per year

Papers

A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives CVPR 2024

Align Before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition CVPR 2024

TAPVid-3D: A Benchmark for Tracking Any Point in 3D NIPS 2024

MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning NIPS 2024

TempCompass: Do Video LLMs Really Understand Videos? ACL 2024

Neighbor Relations Matter in Video Scene Detection CVPR 2024

Putting the Object Back into Video Object Segmentation CVPR 2024

Retrieval-Augmented Egocentric Video Captioning CVPR 2024

Active Speaker Detection in Fisheye Meeting Scenes with Scene Spatial Spectrums INTERSPEECH 2024

Learned Scanpaths Aid Blind Panoramic Video Quality Assessment CVPR 2024

Test-Time Zero-Shot Temporal Action Localization CVPR 2024

Towards a new research agenda for multimodal enterprise document understanding: What are we missing? ACL 2024

SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering ACL 2024

Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering CVPR 2024

Modular Blind Video Quality Assessment CVPR 2024

IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos NIPS 2024

Context-Aware Integration of Language and Visual References for Natural Language Tracking CVPR 2024

N-gram Unsupervised Compoundation and Feature Injection for Better Symbolic Music Understanding AAAI 2024

VVS: Video-to-Video Retrieval with Irrelevant Frame Suppression AAAI 2024

CityPulse: Fine-Grained Assessment of Urban Change with Street View Time Series AAAI 2024

SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling CVPR 2024

No More Shortcuts: Realizing the Potential of Temporal Self-Supervision AAAI 2024

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language AAAI 2024

PPLNs: Parametric Piecewise Linear Networks for Event-Based Temporal Modeling and Beyond NIPS 2024

Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding AAAI 2024