Arsha Nagrani

40 papers · 2017–2025 · 9 conferences · across top CS/AI conferences

Achievements

+14 more ↓

🌍 Conference Polyglot (9) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (5) 🏃 Academic Marathon (8)

🏃 Academic Marathon (8) 🐝 Cross-Pollinator (10) 🗺️ Taxonomy Completionist (69) 👥 Mega-Team (43) 🔬 Deep Specialist (17) 🤝 Dynamic Duo (20) 🧬 Topic Evolution 🚀 Conference Pioneer 🗃️ Keyword Collector (170) 📈 Trend Setter 💎 Century Club (40) 🔥 Unstoppable (9) ❓ The Questioner ⚡ Prolific Year (7)

Conferences

CVPR (15) ICCV (8) INTERSPEECH (5) ECCV (4) NIPS (3) ACL (2) EMNLP (1) IJCNLP (1) WACV (1)

Top co-authors

Cordelia Schmid (20) Andrew Zisserman (14) Anurag Arnab (10) Chen Sun (8) Paul Hongsuck Seo (6) Max Bain (5) Weidi Xie (5) Shyamal Buch (5) Gül Varol (5) Tengda Han (4)

Research topics

Core AI (1)

Keywords

multimodal learning (16) video understanding (12) video captioning (5) contrastive learning (4) temporal localization (4) large language model (4) visual question answering (3) audio description (3) vision-language model (3) video question answering (3) self-supervised learning (3) automatic speech recognition (3) speaker verification (3) cross-modal learning (3) video-language model (3) dense video captioning (3) efficient computing (2) transformer architecture (2) zero-shot learning (2) semantic alignment (2)

Papers

Flexible Frame Selection for Efficient Video Reasoning CVPR 2025 MINERVA: Evaluating Complex Video Reasoning ICCV 2025 Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation ICCV 2025 Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks CVPR 2025 VIEWS: Entity-Aware News Video Captioning EMNLP 2024 Mixture of Nested Experts: Adaptive Processing of Visual Tokens NIPS 2024 Streaming Dense Video Captioning CVPR 2024 On Scaling Up a Multilingual Vision and Language Model CVPR 2024 MoReVQA: Exploring Modular Reasoning Models for Video Question Answering CVPR 2024 AutoAD III: The Prequel - Back to the Pixels CVPR 2024 VicTR: Video-conditioned Text Representations for Activity Recognition CVPR 2024 Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning CVPR 2023 AutoAD: Movie Description in Context CVPR 2023 UnLoc: A Unified Framework for Video Localization Tasks ICCV 2023 LanSER: Language-Model Supported Speech Emotion Recognition INTERSPEECH 2023 AutoAD II: The Sequel - Who, When, and What in Movie Audio Description ICCV 2023 VidChapters-7M: Video Chapters at Scale NIPS 2023 AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR CVPR 2023 Modular Visual Question Answering via Code Generation ACL 2023 Verbs in Action: Improving Verb Understanding in Video-Language Models ICCV 2023 AVATAR: Unconstrained Audiovisual Speech Recognition INTERSPEECH 2022 Masking Modalities for Cross-Modal Video Retrieval WACV 2022 Learning Audio-Video Modalities from Image Captions ECCV 2022 TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency ECCV 2022 End-to-End Generative Pretraining for Multimodal Video Captioning CVPR 2022 Attention Bottlenecks for Multimodal Fusion NIPS 2021 Recognizing Multimodal Entailment ACL 2021 Localizing Visual Sounds the Hard Way CVPR 2021 Look Before You Speak: Visually Contextualized Utterances CVPR 2021 Composable Augmentation Encoding for Video Representation Learning ICCV 2021 Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval ICCV 2021 Recognizing Multimodal Entailment IJCNLP 2021 Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos ECCV 2020 Spot the Conversation: Speaker Diarisation in the Wild INTERSPEECH 2020 Speech2Action: Cross-Modal Supervision for Action Recognition CVPR 2020 EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition ICCV 2019 VoxCeleb2: Deep Speaker Recognition INTERSPEECH 2018 Learnable PINs: Cross-Modal Embeddings for Person Identity ECCV 2018 Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching CVPR 2018 VoxCeleb: A Large-Scale Speaker Identification Dataset INTERSPEECH 2017