Cordelia Schmid

151 papers · 2013–2025 · 13 conferences · across top CS/AI conferences

Achievements

+19 more ↓

🗺️ Taxonomy Completionist (15) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (7) 🌍 Conference Polyglot (13)

🌈 Renaissance Researcher (7) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌟 Keyword Trendsetter Combo (4) 🏠 Conference Loyalist (43) 🧬 Topic Evolution 🤝 Dynamic Duo (30) 🏆 Grand Slam 👑 Triple Crown 🌱 Topic Pioneer 🔬 Deep Specialist (29) 🏆 Keyword Champion 🗃️ Keyword Collector (541) 💎 Century Club (151) ❓ The Questioner (5) 📈 Trend Setter 🚀 Conference Pioneer ⚡ Prolific Year (10) 🔥 Unstoppable (13)

Conferences

CVPR (53) ICCV (43) NIPS (18) ECCV (17) CORL (4) ICML (4) ACL (3) ICLR (3) WACV (2) AAAI (1) EMNLP (1) INTERSPEECH (1) NAACL (1)

Top co-authors

Ivan Laptev (30) Chen Sun (29) Arsha Nagrani (20) Anurag Arnab (18) Karteek Alahari (13) Ahmet Iscen (11) Shizhe Chen (10) Alireza Fathi (10) Philippe Weinzaepfel (9) Jean Ponce (9)

Research topics

Core AI (1)

Keywords

video understanding (24) multimodal learning (17) action recognition (16) self-supervised learning (12) optical flow (11) zero-shot learning (9) object detection (8) vision-language model (8) convolutional neural network (8) 3d reconstruction (6) representation learning (6) human pose estimation (6) large language model (5) image classification (5) video captioning (5) video classification (5) video question answering (5) feature extraction (4) weakly supervised learning (4) attention mechanism (4)

Papers

Large-scale Pre-training for Grounded Video Caption Generation ICCV 2025 InteractVLM: 3D Interaction Reasoning from 2D Foundational Models CVPR 2025 FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement CVPR 2025 Flexible Frame Selection for Efficient Video Reasoning CVPR 2025 Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs CVPR 2025 Dense Video Object Captioning from Disjoint Supervision ICLR 2025 Visual Lexicon: Rich Image Features in Language Space CVPR 2025 MINERVA: Evaluating Complex Video Reasoning ICCV 2025 HORT: Monocular Hand-held Objects Reconstruction with Transformers ICCV 2025 mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus ACL 2025 Language-Guided Image Tokenization for Generation CVPR 2025 Towards Zero-Shot Multimodal Machine Translation NAACL 2025 OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models EMNLP 2025 Retrieval-Enhanced Contrastive Vision-Text Models ICLR 2024 End-to-End Spatio-Temporal Action Localisation with Video Transformers CVPR 2024 Dense Optical Tracking: Connecting the Dots CVPR 2024 A Generative Approach for Wikipedia-Scale Visual Entity Recognition CVPR 2024 Pixel-Aligned Language Model CVPR 2024 SUGAR: Pre-training 3D Visual Representations for Robotics CVPR 2024 Time- Memory- and Parameter-Efficient Visual Adaptation CVPR 2024 MoReVQA: Exploring Modular Reasoning Models for Video Question Answering CVPR 2024 Learning Correlation Structures for Vision Transformers CVPR 2024 Streaming Dense Video Captioning CVPR 2024 DataDream: Few-shot Guided Dataset Generation ECCV 2024 Location-Aware Self-Supervised Transformers for Semantic Segmentation WACV 2024 SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code ICML 2024 Smoke and Mirrors in Causal Downstream Tasks NIPS 2024 Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach NIPS 2024 CoVR: Learning Composed Video Retrieval from Web Video Captions AAAI 2024 Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts ICCV 2023 AVIS: Autonomous Visual Information Seeking with Large Language Model Agent NIPS 2023 Does Visual Pretraining Help End-to-End Reasoning? NIPS 2023 VidChapters-7M: Video Chapters at Scale NIPS 2023 PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation CORL 2023 Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation ACL 2023 Modular Visual Question Answering via Code Generation ACL 2023 Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning CVPR 2023 Improving Image Recognition by Retrieving From Web-Scale Image-Text Data CVPR 2023 Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification CVPR 2023 REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory CVPR 2023 AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR CVPR 2023 How Can Objects Help Action Recognition? CVPR 2023 gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction CVPR 2023 Verbs in Action: Improving Verb Understanding in Video-Language Models ICCV 2023 WALDO: Future Video Synthesis Using Object Layer Decomposition and Parametric Flow Prediction ICCV 2023 UnLoc: A Unified Framework for Video Localization Tasks ICCV 2023 Audiovisual Masked Autoencoders ICCV 2023 Instruction-driven history-aware policies for robotic manipulations CORL 2022 TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency ECCV 2022 Learning from Unlabeled 3D Environments for Vision-and-Language Navigation ECCV 2022 End-to-End Generative Pretraining for Multimodal Video Captioning CVPR 2022 Multiview Transformers for Video Recognition CVPR 2022 Masking Modalities for Cross-Modal Video Retrieval WACV 2022 AVATAR: Unconstrained Audiovisual Speech Recognition INTERSPEECH 2022 Learning With Neighbor Consistency for Noisy Labels CVPR 2022 Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation CVPR 2022 TubeDETR: Spatio-Temporal Video Grounding With Transformers CVPR 2022 AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction ECCV 2022 Learning Audio-Video Modalities from Image Captions ECCV 2022 Zero-Shot Video Question Answering via Frozen Bidirectional Language Models NIPS 2022 Language Conditioned Spatial Relation Reasoning for 3D Object Grounding NIPS 2022 Composable Augmentation Encoding for Video Representation Learning ICCV 2021 HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps CVPR 2021 Look Before You Speak: Visually Contextualized Utterances CVPR 2021 Goal-Conditioned Reinforcement Learning with Imagined Subgoals ICML 2021 History Aware Multimodal Transformer for Vision-and-Language Navigation NIPS 2021 CCVS: Context-aware Controllable Video Synthesis NIPS 2021 Attention Bottlenecks for Multimodal Fusion NIPS 2021 Airbert: In-Domain Pretraining for Vision-and-Language Navigation ICCV 2021 Unified Graph Structured Models for Video Understanding ICCV 2021 Improving Robustness Against Common Corruptions With Frequency Biased Models ICCV 2021 Learning Temporal Dynamics From Cycles in Narrated Video ICCV 2021 Segmenter: Transformer for Semantic Segmentation ICCV 2021 ViViT: A Video Vision Transformer ICCV 2021 Episodic Transformer for Vision-and-Language Navigation ICCV 2021 Just Ask: Learning To Answer Questions From Millions of Narrated Videos ICCV 2021 Large-Scale Unsupervised Object Discovery NIPS 2021 Differentiable rendering with perturbed optimizers NIPS 2021 VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation CVPR 2020 Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos ECCV 2020 Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification ECCV 2020 What Makes for Good Views for Contrastive Learning? NIPS 2020 Memory-Efficient Incremental Learning Through Feature Adaptation ECCV 2020 Speech2Action: Cross-Modal Supervision for Action Recognition CVPR 2020 Multi-modal Transformer for Video Retrieval ECCV 2020 TAO: A Large-Scale Benchmark for Tracking Any Object ECCV 2020 Consistency Guided Scene Flow Estimation ECCV 2020 Graph convolutional networks for learning with few clean and many noisy labels ECCV 2020 Radioactive data: tracing through training ICML 2020 TNT: Target-driven Trajectory Prediction CORL 2020 Learning Obstacle Representations for Neural Motion Planning CORL 2020 Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction CVPR 2020 White-box vs Black-box: Bayes Optimal Strategies for Membership Inference ICML 2019 Spreading vectors for similarity search ICLR 2019 Adaptive Density Estimation for Generative Models NIPS 2019 Detecting Unseen Visual Relations Using Analogies ICCV 2019 Moulding Humans: Non-Parametric 3D Human Shape Estimation From Single Images ICCV 2019 Diversity With Cooperation: Ensemble Methods for Few-Shot Classification ICCV 2019 Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera ICCV 2019 VideoBERT: A Joint Model for Video and Language Representation Learning ICCV 2019 Relational Action Forecasting CVPR 2019 MARS: Motion-Augmented RGB Stream for Action Recognition CVPR 2019 A Structured Model for Action Detection CVPR 2019 Learning Joint Reconstruction of Hands and Manipulated Objects CVPR 2019 End-to-End Incremental Learning ECCV 2018 BodyNet: Volumetric Inference of 3D Human Body Shapes ECCV 2018 A flexible model for training action localization with varying levels of supervision NIPS 2018 Unsupervised Learning of Artistic Styles with Archetypal Style Analysis NIPS 2018 Modeling Visual Context is Key to Augmenting Object Detection Datasets ECCV 2018 AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions CVPR 2018 PoTion: Pose MoTion Representation for Action Recognition CVPR 2018 Actor and Observer: Joint Modeling of First and Third-Person Videos CVPR 2018 How good is my GAN? ECCV 2018 Actor-centric Relation Network ECCV 2018 Joint Learning of Object and Action Detectors ICCV 2017 Action Tubelet Detector for Spatio-Temporal Action Localization ICCV 2017 Learning Video Object Segmentation With Visual Memory ICCV 2017 Learning From Synthetic Humans CVPR 2017 Weakly-Supervised Learning of Visual Relations ICCV 2017 Learning Motion Patterns in Videos CVPR 2017 LCR-Net: Localization-Classification-Regression for Human Pose CVPR 2017 Areas of Attention for Image Captioning ICCV 2017 SCNet: Learning Semantic Correspondence ICCV 2017 Incremental Learning of Object Detectors Without Catastrophic Forgetting ICCV 2017 BlitzNet: A Real-Time Deep Network for Scene Understanding ICCV 2017 MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild NIPS 2016 Proposal Flow CVPR 2016 Weakly-Supervised Alignment of Video With Text ICCV 2015 Learning to Detect Motion Boundaries CVPR 2015 Online Object Tracking With Proposal Selection ICCV 2015 Learning to Track for Spatio-Temporal Action Localization ICCV 2015 Unsupervised Object Discovery and Tracking in Video Collections ICCV 2015 Local Convolutional Features With Unsupervised Training for Image Retrieval ICCV 2015 P-CNN: Pose-Based CNN Features for Action Recognition ICCV 2015 Unsupervised Object Discovery and Localization in the Wild: Part-Based Matching With Bottom-Up Region Proposals CVPR 2015 EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow CVPR 2015 Convolutional Kernel Networks NIPS 2014 Mixing Body-Part Sequences for Human Pose Estimation CVPR 2014 Multi-fold MIL Training for Weakly Supervised Object Localization CVPR 2014 Efficient Action Localization with Approximately Normalized Fisher Vectors CVPR 2014 Transformation Pursuit for Image Classification CVPR 2014 Label-Embedding for Attribute-Based Classification CVPR 2013 Expanded Parts Model for Human Attribute and Action Recognition in Still Images CVPR 2013 Event Retrieval in Large Video Collections with Circulant Temporal Encoding CVPR 2013 Stable Hyper-pooling and Query Expansion for Event Detection ICCV 2013 DeepFlow: Large Displacement Optical Flow with Deep Matching ICCV 2013 Towards Understanding Action Recognition ICCV 2013 Segmentation Driven Object Detection with Fisher Vectors ICCV 2013 Estimating Human Pose with Flowing Puppets ICCV 2013 Action and Event Recognition with Fisher Vectors on a Compact Feature Set ICCV 2013 Action Recognition with Improved Trajectories ICCV 2013