Cordelia Schmid
151 papers · 2013–2025 · 13 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+19 more ↓ Show less ↑
πΊοΈ Taxonomy Completionist (15) π§ Keyword Pioneer π Interdisciplinary Bridge π Renaissance Researcher (7) π Conference Polyglot (13)
π
Renaissance Researcher
(7)
π
Interdisciplinary Bridge
π§
Keyword Pioneer
π
Keyword Trendsetter Combo
(4)
π
Conference Loyalist
(43)
π§¬
Topic Evolution
π€
Dynamic Duo
(30)
π
Grand Slam
π
Triple Crown
π±
Topic Pioneer
π¬
Deep Specialist
(29)
π
Keyword Champion
ποΈ
Keyword Collector
(541)
π
Century Club
(151)
β
The Questioner
(5)
π
Trend Setter
π
Conference Pioneer
β‘
Prolific Year
(10)
π₯
Unstoppable
(13)
Conferences
CVPR (53)
ICCV (43)
NIPS (18)
ECCV (17)
CORL (4)
ICML (4)
ACL (3)
ICLR (3)
WACV (2)
AAAI (1)
EMNLP (1)
INTERSPEECH (1)
NAACL (1)
Top co-authors
Research topics
Keywords
video understanding
(24)
multimodal learning
(17)
action recognition
(16)
self-supervised learning
(12)
optical flow
(11)
zero-shot learning
(9)
object detection
(8)
vision-language model
(8)
convolutional neural network
(8)
3d reconstruction
(6)
representation learning
(6)
human pose estimation
(6)
large language model
(5)
image classification
(5)
video captioning
(5)
video classification
(5)
video question answering
(5)
feature extraction
(4)
weakly supervised learning
(4)
attention mechanism
(4)
Papers
Large-scale Pre-training for Grounded Video Caption Generation
ICCV 2025
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models
CVPR 2025
FirePlace: Geometric Refinements of LLM Common Sense Reasoning for 3D Object Placement
CVPR 2025
Flexible Frame Selection for Efficient Video Reasoning
CVPR 2025
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
CVPR 2025
Dense Video Object Captioning from Disjoint Supervision
ICLR 2025
Visual Lexicon: Rich Image Features in Language Space
CVPR 2025
MINERVA: Evaluating Complex Video Reasoning
ICCV 2025
HORT: Monocular Hand-held Objects Reconstruction with Transformers
ICCV 2025
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
ACL 2025
Language-Guided Image Tokenization for Generation
CVPR 2025
Towards Zero-Shot Multimodal Machine Translation
NAACL 2025
OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models
EMNLP 2025
Retrieval-Enhanced Contrastive Vision-Text Models
ICLR 2024
End-to-End Spatio-Temporal Action Localisation with Video Transformers
CVPR 2024
Dense Optical Tracking: Connecting the Dots
CVPR 2024
A Generative Approach for Wikipedia-Scale Visual Entity Recognition
CVPR 2024
Pixel-Aligned Language Model
CVPR 2024
SUGAR: Pre-training 3D Visual Representations for Robotics
CVPR 2024
Time- Memory- and Parameter-Efficient Visual Adaptation
CVPR 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
CVPR 2024
Learning Correlation Structures for Vision Transformers
CVPR 2024
Streaming Dense Video Captioning
CVPR 2024
DataDream: Few-shot Guided Dataset Generation
ECCV 2024
Location-Aware Self-Supervised Transformers for Semantic Segmentation
WACV 2024
SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code
ICML 2024
Smoke and Mirrors in Causal Downstream Tasks
NIPS 2024
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach
NIPS 2024
CoVR: Learning Composed Video Retrieval from Web Video Captions
AAAI 2024
Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts
ICCV 2023
AVIS: Autonomous Visual Information Seeking with Large Language Model Agent
NIPS 2023
Does Visual Pretraining Help End-to-End Reasoning?
NIPS 2023
VidChapters-7M: Video Chapters at Scale
NIPS 2023
PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation
CORL 2023
Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation
ACL 2023
Modular Visual Question Answering via Code Generation
ACL 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
CVPR 2023
Improving Image Recognition by Retrieving From Web-Scale Image-Text Data
CVPR 2023
Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification
CVPR 2023
REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory
CVPR 2023
AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR
CVPR 2023
How Can Objects Help Action Recognition?
CVPR 2023
gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction
CVPR 2023
Verbs in Action: Improving Verb Understanding in Video-Language Models
ICCV 2023
WALDO: Future Video Synthesis Using Object Layer Decomposition and Parametric Flow Prediction
ICCV 2023
UnLoc: A Unified Framework for Video Localization Tasks
ICCV 2023
Audiovisual Masked Autoencoders
ICCV 2023
Instruction-driven history-aware policies for robotic manipulations
CORL 2022
TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency
ECCV 2022
Learning from Unlabeled 3D Environments for Vision-and-Language Navigation
ECCV 2022
End-to-End Generative Pretraining for Multimodal Video Captioning
CVPR 2022
Multiview Transformers for Video Recognition
CVPR 2022
Masking Modalities for Cross-Modal Video Retrieval
WACV 2022
AVATAR: Unconstrained Audiovisual Speech Recognition
INTERSPEECH 2022
Learning With Neighbor Consistency for Noisy Labels
CVPR 2022
Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation
CVPR 2022
TubeDETR: Spatio-Temporal Video Grounding With Transformers
CVPR 2022
AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction
ECCV 2022
Learning Audio-Video Modalities from Image Captions
ECCV 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
NIPS 2022
Language Conditioned Spatial Relation Reasoning for 3D Object Grounding
NIPS 2022
Composable Augmentation Encoding for Video Representation Learning
ICCV 2021
HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps
CVPR 2021
Look Before You Speak: Visually Contextualized Utterances
CVPR 2021
Goal-Conditioned Reinforcement Learning with Imagined Subgoals
ICML 2021
History Aware Multimodal Transformer for Vision-and-Language Navigation
NIPS 2021
CCVS: Context-aware Controllable Video Synthesis
NIPS 2021
Attention Bottlenecks for Multimodal Fusion
NIPS 2021
Airbert: In-Domain Pretraining for Vision-and-Language Navigation
ICCV 2021
Unified Graph Structured Models for Video Understanding
ICCV 2021
Improving Robustness Against Common Corruptions With Frequency Biased Models
ICCV 2021
Learning Temporal Dynamics From Cycles in Narrated Video
ICCV 2021
Segmenter: Transformer for Semantic Segmentation
ICCV 2021
ViViT: A Video Vision Transformer
ICCV 2021
Episodic Transformer for Vision-and-Language Navigation
ICCV 2021
Just Ask: Learning To Answer Questions From Millions of Narrated Videos
ICCV 2021
Large-Scale Unsupervised Object Discovery
NIPS 2021
Differentiable rendering with perturbed optimizers
NIPS 2021
VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation
CVPR 2020
Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos
ECCV 2020
Selecting Relevant Features from a Multi-domain Representation for Few-shot Classification
ECCV 2020
What Makes for Good Views for Contrastive Learning?
NIPS 2020
Memory-Efficient Incremental Learning Through Feature Adaptation
ECCV 2020
Speech2Action: Cross-Modal Supervision for Action Recognition
CVPR 2020
Multi-modal Transformer for Video Retrieval
ECCV 2020
TAO: A Large-Scale Benchmark for Tracking Any Object
ECCV 2020
Consistency Guided Scene Flow Estimation
ECCV 2020
Graph convolutional networks for learning with few clean and many noisy labels
ECCV 2020
Radioactive data: tracing through training
ICML 2020
TNT: Target-driven Trajectory Prediction
CORL 2020
Learning Obstacle Representations for Neural Motion Planning
CORL 2020
Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction
CVPR 2020
White-box vs Black-box: Bayes Optimal Strategies for Membership Inference
ICML 2019
Spreading vectors for similarity search
ICLR 2019
Adaptive Density Estimation for Generative Models
NIPS 2019
Detecting Unseen Visual Relations Using Analogies
ICCV 2019
Moulding Humans: Non-Parametric 3D Human Shape Estimation From Single Images
ICCV 2019
Diversity With Cooperation: Ensemble Methods for Few-Shot Classification
ICCV 2019
Self-Supervised Learning With Geometric Constraints in Monocular Video: Connecting Flow, Depth, and Camera
ICCV 2019
VideoBERT: A Joint Model for Video and Language Representation Learning
ICCV 2019
Relational Action Forecasting
CVPR 2019
MARS: Motion-Augmented RGB Stream for Action Recognition
CVPR 2019
A Structured Model for Action Detection
CVPR 2019
Learning Joint Reconstruction of Hands and Manipulated Objects
CVPR 2019
End-to-End Incremental Learning
ECCV 2018
BodyNet: Volumetric Inference of 3D Human Body Shapes
ECCV 2018
A flexible model for training action localization with varying levels of supervision
NIPS 2018
Unsupervised Learning of Artistic Styles with Archetypal Style Analysis
NIPS 2018
Modeling Visual Context is Key to Augmenting Object Detection Datasets
ECCV 2018
AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions
CVPR 2018
PoTion: Pose MoTion Representation for Action Recognition
CVPR 2018
Actor and Observer: Joint Modeling of First and Third-Person Videos
CVPR 2018
How good is my GAN?
ECCV 2018
Actor-centric Relation Network
ECCV 2018
Joint Learning of Object and Action Detectors
ICCV 2017
Action Tubelet Detector for Spatio-Temporal Action Localization
ICCV 2017
Learning Video Object Segmentation With Visual Memory
ICCV 2017
Learning From Synthetic Humans
CVPR 2017
Weakly-Supervised Learning of Visual Relations
ICCV 2017
Learning Motion Patterns in Videos
CVPR 2017
LCR-Net: Localization-Classification-Regression for Human Pose
CVPR 2017
Areas of Attention for Image Captioning
ICCV 2017
SCNet: Learning Semantic Correspondence
ICCV 2017
Incremental Learning of Object Detectors Without Catastrophic Forgetting
ICCV 2017
BlitzNet: A Real-Time Deep Network for Scene Understanding
ICCV 2017
MoCap-guided Data Augmentation for 3D Pose Estimation in the Wild
NIPS 2016
Proposal Flow
CVPR 2016
Weakly-Supervised Alignment of Video With Text
ICCV 2015
Learning to Detect Motion Boundaries
CVPR 2015
Online Object Tracking With Proposal Selection
ICCV 2015
Learning to Track for Spatio-Temporal Action Localization
ICCV 2015
Unsupervised Object Discovery and Tracking in Video Collections
ICCV 2015
Local Convolutional Features With Unsupervised Training for Image Retrieval
ICCV 2015
P-CNN: Pose-Based CNN Features for Action Recognition
ICCV 2015
Unsupervised Object Discovery and Localization in the Wild: Part-Based Matching With Bottom-Up Region Proposals
CVPR 2015
EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow
CVPR 2015
Convolutional Kernel Networks
NIPS 2014
Mixing Body-Part Sequences for Human Pose Estimation
CVPR 2014
Multi-fold MIL Training for Weakly Supervised Object Localization
CVPR 2014
Efficient Action Localization with Approximately Normalized Fisher Vectors
CVPR 2014
Transformation Pursuit for Image Classification
CVPR 2014
Label-Embedding for Attribute-Based Classification
CVPR 2013
Expanded Parts Model for Human Attribute and Action Recognition in Still Images
CVPR 2013
Event Retrieval in Large Video Collections with Circulant Temporal Encoding
CVPR 2013
Stable Hyper-pooling and Query Expansion for Event Detection
ICCV 2013
DeepFlow: Large Displacement Optical Flow with Deep Matching
ICCV 2013
Towards Understanding Action Recognition
ICCV 2013
Segmentation Driven Object Detection with Fisher Vectors
ICCV 2013
Estimating Human Pose with Flowing Puppets
ICCV 2013
Action and Event Recognition with Fisher Vectors on a Compact Feature Set
ICCV 2013
Action Recognition with Improved Trajectories
ICCV 2013