Computer Vision › Processing ›

Video Understanding

1592 directly classified papers

Papers per year

Papers

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language AAAI 2024

Video Discourse Parsing and Its Application to Multimodal Summarization: A Dataset and Baseline Approaches EMNLP 2024

MMAR: Multilingual and Multimodal Anaphora Resolution in Instructional Videos EMNLP 2024

VIEWS: Entity-Aware News Video Captioning EMNLP 2024

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval AAAI 2024

DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification AAAI 2024

Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches COLING 2024

Multi-View Dynamic Reflection Prior for Video Glass Surface Detection AAAI 2024

DiffusionTrack: Diffusion Model for Multi-Object Tracking AAAI 2024

Video-Based Human Pose Regression via Decoupled Space-Time Aggregation CVPR 2024

TD²-Net: Toward Denoising and Debiasing for Video Scene Graph Generation AAAI 2024

TR-DETR: Task-Reciprocal Transformer for Joint Moment Retrieval and Highlight Detection AAAI 2024

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation AAAI 2024

Context Enhanced Transformer for Single Image Object Detection in Video Data AAAI 2024

SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization INTERSPEECH 2024

MaViLS, a Benchmark Dataset for Video-to-Slide Alignment, Assessing Baseline Accuracy with a Multimodal Alignment Algorithm Leveraging Speech, OCR, and Visual Features INTERSPEECH 2024

Video-Text Prompting for Weakly Supervised Spatio-Temporal Video Grounding EMNLP 2024

A multimodal analysis of different types of laughter expression in conversational dialogues INTERSPEECH 2024

Diving Deep into the Motion Representation of Video-Text Models ACL 2024

TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression CVPR 2024

SAVSR: Arbitrary-Scale Video Super-Resolution via a Learned Scale-Adaptive Network AAAI 2024

Streaming Dense Video Captioning CVPR 2024

Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation CVPR 2024

Collaborative Weakly Supervised Video Correlation Learning for Procedure-Aware Instructional Video Analysis AAAI 2024

Frame2: A FrameNet-based Multimodal Dataset for Tackling Text-image Interactions in Video COLING 2024