Arsha Nagrani
40 papers · 2017–2025 · 9 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+14 more ↓ Show less ↑
π Conference Polyglot (9) π§ Keyword Pioneer π Interdisciplinary Bridge π Renaissance Researcher (5) π Academic Marathon (8)
π
Academic Marathon
(8)
π
Cross-Pollinator
(10)
πΊοΈ
Taxonomy Completionist
(69)
π₯
Mega-Team
(43)
π¬
Deep Specialist
(17)
π€
Dynamic Duo
(20)
π§¬
Topic Evolution
π
Conference Pioneer
ποΈ
Keyword Collector
(170)
π
Trend Setter
π
Century Club
(40)
π₯
Unstoppable
(9)
β
The Questioner
β‘
Prolific Year
(7)
Conferences
CVPR (15)
ICCV (8)
INTERSPEECH (5)
ECCV (4)
NIPS (3)
ACL (2)
EMNLP (1)
IJCNLP (1)
WACV (1)
Top co-authors
Research topics
Keywords
multimodal learning
(16)
video understanding
(12)
video captioning
(5)
contrastive learning
(4)
temporal localization
(4)
large language model
(4)
visual question answering
(3)
audio description
(3)
vision-language model
(3)
video question answering
(3)
self-supervised learning
(3)
automatic speech recognition
(3)
speaker verification
(3)
cross-modal learning
(3)
video-language model
(3)
dense video captioning
(3)
efficient computing
(2)
transformer architecture
(2)
zero-shot learning
(2)
semantic alignment
(2)
Papers
Flexible Frame Selection for Efficient Video Reasoning
CVPR 2025
MINERVA: Evaluating Complex Video Reasoning
ICCV 2025
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation
ICCV 2025
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks
CVPR 2025
VIEWS: Entity-Aware News Video Captioning
EMNLP 2024
Mixture of Nested Experts: Adaptive Processing of Visual Tokens
NIPS 2024
Streaming Dense Video Captioning
CVPR 2024
On Scaling Up a Multilingual Vision and Language Model
CVPR 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
CVPR 2024
AutoAD III: The Prequel - Back to the Pixels
CVPR 2024
VicTR: Video-conditioned Text Representations for Activity Recognition
CVPR 2024
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
CVPR 2023
AutoAD: Movie Description in Context
CVPR 2023
UnLoc: A Unified Framework for Video Localization Tasks
ICCV 2023
LanSER: Language-Model Supported Speech Emotion Recognition
INTERSPEECH 2023
AutoAD II: The Sequel - Who, When, and What in Movie Audio Description
ICCV 2023
VidChapters-7M: Video Chapters at Scale
NIPS 2023
AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR
CVPR 2023
Modular Visual Question Answering via Code Generation
ACL 2023
Verbs in Action: Improving Verb Understanding in Video-Language Models
ICCV 2023
AVATAR: Unconstrained Audiovisual Speech Recognition
INTERSPEECH 2022
Masking Modalities for Cross-Modal Video Retrieval
WACV 2022
Learning Audio-Video Modalities from Image Captions
ECCV 2022
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
ECCV 2022
End-to-End Generative Pretraining for Multimodal Video Captioning
CVPR 2022
Attention Bottlenecks for Multimodal Fusion
NIPS 2021
Recognizing Multimodal Entailment
ACL 2021
Localizing Visual Sounds the Hard Way
CVPR 2021
Look Before You Speak: Visually Contextualized Utterances
CVPR 2021
Composable Augmentation Encoding for Video Representation Learning
ICCV 2021
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
ICCV 2021
Recognizing Multimodal Entailment
IJCNLP 2021
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
ECCV 2020
Spot the Conversation: Speaker Diarisation in the Wild
INTERSPEECH 2020
Speech2Action: Cross-Modal Supervision for Action Recognition
CVPR 2020
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
ICCV 2019
VoxCeleb2: Deep Speaker Recognition
INTERSPEECH 2018
Learnable PINs: Cross-Modal Embeddings for Person Identity
ECCV 2018
Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
CVPR 2018
VoxCeleb: A Large-Scale Speaker Identification Dataset
INTERSPEECH 2017