Andrew Zisserman

144 papers · 2006–2026 · 12 conferences · across top CS/AI conferences

Achievements

+21 more ↓

🧭 Keyword Pioneer 🗺️ Taxonomy Completionist (25) 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (5) 🐣 Hot Topic Early Bird

🧭 Keyword Pioneer 🗺️ Taxonomy Completionist (25) 🌈 Renaissance Researcher (5) 🌟 Keyword Trendsetter Combo (23) 🏠 Conference Loyalist (28) 📛 The Namer 🌱 Topic Pioneer 🏆 Keyword Champion (2) 🧬 Topic Evolution 🤝 Dynamic Duo (21) 👑 Triple Crown 👥 Mega-Team (27) 🏆 Grand Slam 🔬 Deep Specialist (23) ❓ The Questioner (2) 🔥 Unstoppable (13) 🗃️ Keyword Collector (80) ⚡ Prolific Year (13) 💎 Century Club (142) 📈 Trend Setter 🚀 Conference Pioneer

Conferences

CVPR (52) NIPS (28) ICCV (24) ECCV (17) INTERSPEECH (9) ICLR (5) ICML (3) MICCAI (2) AAAI (1) ACL (1) MIDL (1) WACV (1)

Top co-authors

Weidi Xie (21) Joao Carreira (17) Andrea Vedaldi (15) Arsha Nagrani (14) Triantafyllos Afouras (13) Carl Doersch (12) Tengda Han (11) Gül Varol (11) Joon Son Chung (10) Ankush Gupta (9)

Research topics

Models (1) Core AI (1)

Keywords

video understanding (22) self-supervised learning (15) action recognition (14) multimodal learning (11) contrastive learning (10) representation learning (9) object detection (9) optical flow (9) convolutional neural network (9) video representation (8) zero-shot learning (7) transformer architecture (7) weakly supervised learning (7) semantic segmentation (6) depth estimation (5) transfer learning (5) video segmentation (5) image segmentation (5) cross-modal learning (5) feature extraction (4)

Papers

Segment, Embed, and Align: A Universal Recipe for Aligning Subtitles to Signing ACL 2026 Open-World Object Counting in Videos AAAI 2026 Understanding Co-speech Gestures in-the-wild ICCV 2025 LayerLock: Non-collapsing Representation Learning with Progressive Freezing ICCV 2025 From Panels to Prose: Generating Literary Narratives from Comics ICCV 2025 Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation ICCV 2025 Lost in Translation, Found in Context: Sign Language Translation with Contextual Cues CVPR 2025 SciVid: Cross-Domain Evaluation of Video Models in Scientific Applications ICCV 2025 Learning from Streaming Video with Orthogonal Gradients CVPR 2025 Amodal Ground Truth and Completion in the Wild CVPR 2024 TIM: A Time Interval Machine for Audio-Visual Action Recognition CVPR 2024 Appearance-based Refinement for Object-Centric Motion Segmentation ECCV 2024 Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language CVPR 2024 AutoAD III: The Prequel - Back to the Pixels CVPR 2024 Learning from One Continuous Video Stream CVPR 2024 A General Protocol to Probe Large Vision Models for 3D Physical Understanding NIPS 2024 CountGD: Multi-Modal Open-World Counting NIPS 2024 FlexCap: Describe Anything in Images in Controllable Detail NIPS 2024 TAPVid-3D: A Benchmark for Tracking Any Point in 3D NIPS 2024 A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames CVPR 2024 The Manga Whisperer: Automatically Generating Transcriptions for Comics CVPR 2024 Speech Recognition Models are Strong Lip-readers INTERSPEECH 2024 3D Spine Shape Estimation from Single 2D DXA MICCAI 2024 Automated Spinal MRI Labelling from Reports Using a Large Language Model MICCAI 2024 N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields ECCV 2024 Made to Order: Discovering monotonic temporal changes via self-supervised video ordering ECCV 2024 Text-Conditioned Resampler For Long Form Video Understanding ECCV 2024 The Change You Want To See WACV 2023 AutoAD: Movie Description in Context CVPR 2023 WhisperX: Time-Accurate Speech Transcription of Long-Form Audio INTERSPEECH 2023 Multi-Modal Classifiers for Open-Vocabulary Object Detection ICML 2023 A Light Touch Approach to Teaching Transformers Multi-View Geometry CVPR 2023 Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion NIPS 2023 No Representation Rules Them All in Category Discovery NIPS 2023 Perception Test: A Diagnostic Benchmark for Multimodal Video Models NIPS 2023 TAPIR: Tracking Any Point with Per-Frame Initialization and Temporal Refinement ICCV 2023 Helping Hands: An Object-Aware Ego-Centric Video Recognition Model ICCV 2023 The Making and Breaking of Camouflage ICCV 2023 Verbs in Action: Improving Verb Understanding in Video-Language Models ICCV 2023 AutoAD II: The Sequel - Who, When, and What in Movie Audio Description ICCV 2023 Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime MIDL 2023 Automatic Dense Annotation of Large-Vocabulary Sign Language Videos ECCV 2022 Object Discovery and Representation Networks ECCV 2022 Associating Objects and Their Effects in Video through Coordination Games NIPS 2022 Flamingo: a Visual Language Model for Few-Shot Learning NIPS 2022 TAP-Vid: A Benchmark for Tracking Any Point in a Video NIPS 2022 Input-Level Inductive Biases for 3D Reconstruction CVPR 2022 Sub-Word Level Lip Reading With Visual Attention CVPR 2022 Temporal Alignment Networks for Long-Term Video CVPR 2022 Generalized Category Discovery CVPR 2022 It's About Time: Analog Clock Reading in the Wild CVPR 2022 Reading To Listen at the Cocktail Party: Multi-Modal Speech Separation CVPR 2022 Label, Verify, Correct: A Simple Few Shot Object Detection Method CVPR 2022 Open-Set Recognition: A Good Closed-Set Classifier is All You Need ICLR 2022 Perceiver IO: A General Architecture for Structured Inputs & Outputs ICLR 2022 Segmenting Moving Objects via an Object-Centric Layered Representation NIPS 2022 TeachText: CrossModal Generalized Distillation for Text-Video Retrieval ICCV 2021 With a Little Help From My Friends: Nearest-Neighbor Contrastive Learning of Visual Representations ICCV 2021 Aligning Subtitles in Sign Language Videos ICCV 2021 Broaden Your Views for Self-Supervised Video Learning ICCV 2021 Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval ICCV 2021 Perceiver: General Perception with Iterative Attention ICML 2021 Localizing Visual Sounds the Hard Way CVPR 2021 Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers CVPR 2021 Temporal Query Networks for Fine-Grained Video Understanding CVPR 2021 Co-Attention for Conditioned Image Matching CVPR 2021 Omnimatte: Associating Objects and Their Effects in Video CVPR 2021 Read and Attend: Temporal Localisation in Sign Language Videos CVPR 2021 Self-Supervised Video Object Segmentation by Motion Grouping ICCV 2021 Self-Supervised Learning of Audio-Visual Objects from Video ECCV 2020 Self-Supervised MultiModal Versatile Networks NIPS 2020 Self-supervised Co-Training for Video Representation Learning NIPS 2020 CrossTransformers: spatially-aware few-shot transfer NIPS 2020 Visual Grounding in Video for Unsupervised Word Translation CVPR 2020 Speech2Action: Cross-Modal Supervision for Action Recognition CVPR 2020 End-to-End Learning of Visual Representations From Uncurated Instructional Videos CVPR 2020 Counting Out Time: Class Agnostic Video Repetition Counting in the Wild CVPR 2020 Memory-augmented Dense Predictive Coding for Video Representation Learning ECCV 2020 Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval ECCV 2020 BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues ECCV 2020 Amplifying Key Cues for Human-Object-Interaction Detection ECCV 2020 Adaptive Text Recognition through Visual Matching ECCV 2020 Automatically Discovering and Learning New Visual Categories with Ranking Statistics ICLR 2020 Training Neural Networks for and by Interpolation ICML 2020 Spot the Conversation: Speaker Diarisation in the Wild INTERSPEECH 2020 Now You’re Speaking My Language: Visual Language Identification INTERSPEECH 2020 Deep Frank-Wolfe For Neural Network Optimization ICLR 2019 Exploiting Temporal Context for 3D Human Pose Estimation in the Wild CVPR 2019 Sim2real transfer learning for 3D human pose estimation: motion to the rescue NIPS 2019 EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition ICCV 2019 Controllable Attention for Structured Layered Video Decomposition ICCV 2019 Learning to Discover Novel Visual Categories via Deep Transfer Clustering ICCV 2019 Unsupervised Learning of Object Keypoints for Perception and Control NIPS 2019 LAEO-Net: Revisiting People Looking at Each Other in Videos CVPR 2019 My Lips Are Concealed: Audio-Visual Speech Enhancement Through Obstructions INTERSPEECH 2019 Temporal Cycle-Consistency Learning CVPR 2019 Video Action Transformer Network CVPR 2019 The Visual Centrifuge: Model-Free Layered Video Representations CVPR 2019 Learning to Navigate in Cities Without a Map NIPS 2018 VoxCeleb2: Deep Speaker Recognition INTERSPEECH 2018 Smooth Loss Functions for Deep Top-k Classification ICLR 2018 Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching CVPR 2018 Learning and Using the Arrow of Time CVPR 2018 What Have We Learned From Deep Representations for Action Recognition? CVPR 2018 Massively Parallel Video Networks ECCV 2018 Comparator Networks ECCV 2018 Learnable PINs: Cross-Modal Embeddings for Person Identity ECCV 2018 Objects that Sound ECCV 2018 X2Face: A network for controlling face generation using images, audio, and pose codes ECCV 2018 Deep Lip Reading: A Comparison of Models and an Online Application INTERSPEECH 2018 The Conversation: Deep Audio-Visual Speech Enhancement INTERSPEECH 2018 Multi-Task Self-Supervised Visual Learning ICCV 2017 Lip Reading Sentences in the Wild CVPR 2017 Look, Listen and Learn ICCV 2017 Detect to Track and Track to Detect ICCV 2017 Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset CVPR 2017 VoxCeleb: A Large-Scale Speaker Identification Dataset INTERSPEECH 2017 Synthetic Data for Text Localisation in Natural Images CVPR 2016 Personalizing Human Video Pose Estimation CVPR 2016 3D Shape Attributes CVPR 2016 Convolutional Two-Stream Network Fusion for Video Action Recognition CVPR 2016 Flowing ConvNets for Human Pose Estimation in Videos ICCV 2015 Spatial Transformer Networks NIPS 2015 Talking Heads: Detecting Humans and Recognizing Their Interactions CVPR 2014 Seeing the Arrow of Time CVPR 2014 Immediate, Scalable Object Category Detection CVPR 2014 Triangulation Embedding and Democratic Aggregation for Image Search CVPR 2014 Two-Stream Convolutional Networks for Action Recognition in Videos NIPS 2014 A Compact and Discriminative Face Track Descriptor CVPR 2014 Blocks That Shout: Distinctive Parts for Scene Classification CVPR 2013 All About VLAD CVPR 2013 Learning to Detect Partially Overlapping Instances CVPR 2013 Discriminative Sub-categorization CVPR 2013 Deep Fisher Networks for Large-Scale Image Classification NIPS 2013 Symbiotic Segmentation and Part Localization for Fine-Grained Categorization ICCV 2013 Human Pose Estimation Using a Joint Pixel-wise and Part-wise Formulation CVPR 2013 Pylon Model for Semantic Segmentation NIPS 2011 Learning To Count Objects in Images NIPS 2010 Simultaneous Object Detection and Ranking with Weak Supervision NIPS 2010 Structured output regression for detection with partial truncation NIPS 2009 Segmenting Scenes by Matching Image Composites NIPS 2009 Supervised Dictionary Learning NIPS 2008 Learning Visual Attributes NIPS 2007 Bayesian Image Super-resolution, Continued NIPS 2006