Ivan Laptev

64 papers · 2011–2025 · 9 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🧭 Keyword Pioneer 🌍 Conference Polyglot (9) 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (6) 🏃 Academic Marathon (14)

🌉 Interdisciplinary Bridge 🏃 Academic Marathon (14) 🧭 Keyword Pioneer 🌟 Keyword Trendsetter Combo (4) 🏠 Conference Loyalist (28) 🤝 Dynamic Duo (30) 🏆 Keyword Champion 👥 Mega-Team (69) 🔬 Deep Specialist (12) 🚀 Conference Pioneer 💎 Century Club (64) 📈 Trend Setter ⚡ Prolific Year (7) 🗃️ Keyword Collector (290) 🔥 Unstoppable (13) ❓ The Questioner

Conferences

CVPR (28) ICCV (14) NIPS (9) CORL (4) ECCV (4) ACL (2) EMNLP (1) ICML (1) L4DC (1)

Top co-authors

Cordelia Schmid (30) Josef Sivic (25) Jean-Baptiste Alayrac (10) Shizhe Chen (9) Antoine Miech (9) Makarand Tapaswi (8) Pierre-Louis Guhur (6) Antoine Yang (5) Piotr Bojanowski (5) Ricardo Garcia Pinel (3)

Keywords

video understanding (9) multimodal learning (8) action recognition (8) weakly supervised learning (6) object detection (5) self-supervised learning (5) convolutional neural network (5) 3d reconstruction (5) robotic manipulation (4) zero-shot learning (4) depth estimation (3) instructional video (3) large multimodal model (3) video segmentation (3) differentiable rendering (3) semantic segmentation (3) action localization (3) hand pose estimation (3) video question answering (3) multimodal transformer (3)

Papers

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages CVPR 2025 LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs ACL 2025 RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation CVPR 2025 A Culturally-diverse Multilingual Multimodal Video Benchmark & Model EMNLP 2025 ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions CVPR 2025 Learning Feasible Transitions for Efficient Contact Planning L4DC 2025 ScanEdit: Hierarchically-Guided Functional 3D Scan Editing ICCV 2025 SUGAR: Pre-training 3D Visual Representations for Robotics CVPR 2024 PairDETR : Joint Detection and Association of Human Bodies and Faces CVPR 2024 Mitigating Object Hallucination via Concentric Causal Attention NIPS 2024 GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos CVPR 2024 Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning CVPR 2023 PolarNet: 3D Point Clouds for Language-Guided Robotic Manipulation CORL 2023 Tackling Ambiguity with Images: Improved Multimodal Machine Translation and Contrastive Evaluation ACL 2023 gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction CVPR 2023 VidChapters-7M: Video Chapters at Scale NIPS 2023 Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos CVPR 2022 Zero-Shot Video Question Answering via Frozen Bidirectional Language Models NIPS 2022 Language Conditioned Spatial Relation Reasoning for 3D Object Grounding NIPS 2022 Instruction-driven history-aware policies for robotic manipulations CORL 2022 Think Global, Act Local: Dual-Scale Graph Transformer for Vision-and-Language Navigation CVPR 2022 TubeDETR: Spatio-Temporal Video Grounding With Transformers CVPR 2022 AlignSDF: Pose-Aligned Signed Distance Fields for Hand-Object Reconstruction ECCV 2022 Learning from Unlabeled 3D Environments for Vision-and-Language Navigation ECCV 2022 Differentiable rendering with perturbed optimizers NIPS 2021 Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers CVPR 2021 Segmenter: Transformer for Semantic Segmentation ICCV 2021 Airbert: In-Domain Pretraining for Vision-and-Language Navigation ICCV 2021 History Aware Multimodal Transformer for Vision-and-Language Navigation NIPS 2021 Just Ask: Learning To Answer Questions From Millions of Narrated Videos ICCV 2021 XCiT: Cross-Covariance Image Transformers NIPS 2021 Goal-Conditioned Reinforcement Learning with Imagined Subgoals ICML 2021 Learning Obstacle Representations for Neural Motion Planning CORL 2020 Learning Object Manipulation Skills via Approximate State Estimation from Real Videos CORL 2020 Leveraging Photometric Consistency Over Time for Sparsely Supervised Hand-Object Reconstruction CVPR 2020 Action Modifiers: Learning From Adverbs in Instructional Videos CVPR 2020 End-to-End Learning of Visual Representations From Uncurated Instructional Videos CVPR 2020 Learning Interactions and Relationships Between Movie Characters CVPR 2020 Learning Actionness via Long-range Temporal Order Verification ECCV 2020 Deep Metric Learning Beyond Binary Supervision CVPR 2019 Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video CVPR 2019 Learning Joint Reconstruction of Hands and Manipulated Objects CVPR 2019 Detecting Unseen Visual Relations Using Analogies ICCV 2019 HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips ICCV 2019 Cross-Task Weakly Supervised Learning From Instructional Videos CVPR 2019 BodyNet: Volumetric Inference of 3D Human Body Shapes ECCV 2018 A flexible model for training action localization with varying levels of supervision NIPS 2018 Learning From Synthetic Humans CVPR 2017 Weakly-Supervised Learning of Visual Relations ICCV 2017 Learning From Video and Text via Large-Scale Discriminative Clustering ICCV 2017 Joint Discovery of Object States and Manipulation Actions ICCV 2017 Instance-Level Video Segmentation From Object Tracks CVPR 2016 Thin-Slicing for Pose: Learning to Understand Pose Without Explicit Pose Estimation CVPR 2016 Unsupervised Learning From Narrated Instruction Videos CVPR 2016 Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks CVPR 2015 On Pairwise Costs for Network Flow Multi-Object Tracking CVPR 2015 Context-Aware CNNs for Person Head Detection ICCV 2015 Unsupervised Object Discovery and Tracking in Video Collections ICCV 2015 P-CNN: Pose-Based CNN Features for Action Recognition ICCV 2015 Weakly-Supervised Alignment of Video With Text ICCV 2015 Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks CVPR 2014 Efficient Feature Extraction, Encoding and Classification for Action Recognition CVPR 2014 Pose Estimation and Segmentation of People in 3D Movies ICCV 2013 Learning person-object interactions for action recognition in still images NIPS 2011