Josef Sivic

62 papers · 2009–2025 · 9 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🐝 Cross-Pollinator (12) 🌍 Conference Polyglot (9) 🏃 Academic Marathon (16) 🧭 Keyword Pioneer 🌈 Renaissance Researcher (12)

🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌈 Renaissance Researcher (12) 🌟 Keyword Trendsetter Combo (6) 🏠 Conference Loyalist (30) 🤝 Dynamic Duo (25) 👥 Mega-Team (31) 🔬 Deep Specialist (10) 🏆 Keyword Champion ⚡ Prolific Year (5) 📈 Trend Setter 🚀 Conference Pioneer ❓ The Questioner (3) 🗃️ Keyword Collector (230) 🔥 Unstoppable (13) 💎 Century Club (62)

Conferences

CVPR (30) ICCV (13) NIPS (7) ECCV (4) ICLR (3) CORL (2) AAAI (1) EMNLP (1) RSS (1)

Top co-authors

Ivan Laptev (25) Bryan Russell (11) Tomas Pajdla (10) Akihiko Torii (9) Antoine Miech (9) Jean-Baptiste Alayrac (9) Cordelia Schmid (8) Ignacio Rocco (6) Masatoshi Okutomi (6) Relja Arandjelović (6)

Keywords

video understanding (9) convolutional neural network (8) multimodal learning (6) visual localization (6) pose estimation (5) object detection (5) weakly supervised learning (5) 3d reconstruction (4) image matching (4) image retrieval (4) render and compare (3) visual place recognition (3) zero-shot learning (3) place recognition (3) cross-modal retrieval (3) self-supervised learning (3) video retrieval (3) vision-language model (3) video segmentation (2) semantic segmentation (2)

Papers

ShowHowTo: Generating Scene-Conditioned Step-by-Step Visual Instructions CVPR 2025 Improving Personalized Search with Regularized Low-Rank Parameter Updates CVPR 2025 Learning to engineer protein flexibility ICLR 2025 6D Object Pose Tracking in Internet Videos for Robotic Manipulation ICLR 2025 ResidualViT for Efficient Temporally Dense Video Encoding ICCV 2025 Discovering Divergent Representations between Text-to-Image Models ICCV 2025 Large-scale Pre-training for Grounded Video Caption Generation ICCV 2025 Learning to design protein-protein interactions with enhanced generalization ICLR 2024 GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos CVPR 2024 MassSpecGym: A benchmark for the discovery and identification of molecules NIPS 2024 Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning CVPR 2023 Language-Guided Music Recommendation for Video via Prompt Analogies CVPR 2023 Meta-Personalizing Vision-Language Models To Find Named Instances in Video CVPR 2023 POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images NIPS 2023 VidChapters-7M: Video Chapters at Scale NIPS 2023 TubeDETR: Spatio-Temporal Video Grounding With Transformers CVPR 2022 Look for the Change: Learning Object States and State-Modifying Actions From Untrimmed Web Videos CVPR 2022 Zero-Shot Video Question Answering via Frozen Bidirectional Language Models NIPS 2022 MegaPose: 6D Pose Estimation of Novel Objects via Render & Compare CORL 2022 Collision Detection Accelerated: An Optimization Perspective RSS 2022 Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-Modal Distillation ECCV 2022 Focal Length and Object Pose Estimation via Render and Compare CVPR 2022 Weakly Supervised Human-Object Interaction Detection in Video via Contrastive Spatiotemporal Regions ICCV 2021 Artificial Dummies for Urban Dataset Augmentation AAAI 2021 Single-View Robot Pose and Joint Angle Estimation via Render & Compare CVPR 2021 Thinking Fast and Slow: Efficient Text-to-Visual Retrieval With Transformers CVPR 2021 Just Ask: Learning To Answer Questions From Millions of Narrated Videos ICCV 2021 Efficient Neighbourhood Consensus Networks via Submanifold Sparse Convolutions ECCV 2020 End-to-End Learning of Visual Representations From Uncurated Instructional Videos CVPR 2020 Learning Object Manipulation Skills via Approximate State Estimation from Real Videos CORL 2020 Learning Actionness via Long-range Temporal Order Verification ECCV 2020 CosyPose: Consistent multi-view multi-object 6D pose estimation ECCV 2020 Estimating 3D Motion and Forces of Person-Object Interactions From Monocular Video CVPR 2019 Detecting Unseen Visual Relations Using Analogies ICCV 2019 HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips ICCV 2019 Is This the Right Place? Geometric-Semantic Pose Verification for Indoor Visual Localization ICCV 2019 Cross-Task Weakly Supervised Learning From Instructional Videos CVPR 2019 D2-Net: A Trainable CNN for Joint Description and Detection of Local Features CVPR 2019 Benchmarking 6DOF Outdoor Visual Localization in Changing Conditions CVPR 2018 InLoc: Indoor Visual Localization With Dense Matching and View Synthesis CVPR 2018 Neighbourhood Consensus Networks NIPS 2018 End-to-End Weakly-Supervised Semantic Alignment CVPR 2018 Localizing Moments in Video with Temporal Language EMNLP 2018 Learning From Video and Text via Large-Scale Discriminative Clustering ICCV 2017 Localizing Moments in Video With Natural Language ICCV 2017 ActionVLAD: Learning Spatio-Temporal Aggregation for Action Classification CVPR 2017 Are Large-Scale 3D Models Really Necessary for Accurate Visual Localization? CVPR 2017 Convolutional Neural Network Architecture for Geometric Matching CVPR 2017 Joint Discovery of Object States and Manipulation Actions ICCV 2017 Weakly-Supervised Learning of Visual Relations ICCV 2017 NetVLAD: CNN Architecture for Weakly Supervised Place Recognition CVPR 2016 Unsupervised Learning From Narrated Instruction Videos CVPR 2016 On Pairwise Costs for Network Flow Multi-Object Tracking CVPR 2015 24/7 Place Recognition by View Synthesis CVPR 2015 Is Object Localization for Free? - Weakly-Supervised Learning With Convolutional Neural Networks CVPR 2015 Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks CVPR 2014 Seeing 3D Chairs: Exemplar Part-based 2D-3D Alignment using a Large Dataset of CAD Models CVPR 2014 Learning and Calibrating Per-Location Classifiers for Visual Place Recognition CVPR 2013 Pose Estimation and Segmentation of People in 3D Movies ICCV 2013 Visual Place Recognition with Repetitive Structures CVPR 2013 Learning person-object interactions for action recognition in still images NIPS 2011 Segmenting Scenes by Matching Image Composites NIPS 2009