Juan Carlos Niebles

72 papers · 2014–2025 · 14 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🌍 Conference Polyglot (14) 🏃 Academic Marathon (11) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🐝 Cross-Pollinator (7)

🐝 Cross-Pollinator (7) 🌈 Renaissance Researcher (8) 🗺️ Taxonomy Completionist (105) 🏠 Conference Loyalist (27) 🧬 Topic Evolution 🤝 Dynamic Duo (20) 🏆 Keyword Champion (4) 🏆 Grand Slam 🔬 Deep Specialist (11) ❓ The Questioner ⚡ Prolific Year (7) 🚀 Conference Pioneer 🔥 Unstoppable (12) 📈 Trend Setter 💎 Century Club (72) 🗃️ Keyword Collector (302)

Conferences

CVPR (27) ECCV (10) ICCV (9) NIPS (8) EMNLP (5) ICML (3) WACV (3) AAAI (1) ACL (1) CLEAR (1) CONLL (1) ICLR (1) NAACL (1) PGM (1)

Top co-authors

Silvio Savarese (20) Caiming Xiong (18) Li Fei-fei (17) Ehsan Adeli (14) De-An Huang (11) Huan Wang (11) Ran Xu (10) Shelby Heinecke (10) Jiajun Wu (9) Weiran Yao (9)

Research topics

Computer Vision (1)

Keywords

video understanding (14) action recognition (7) vision-language model (5) activity recognition (5) few-shot learning (5) multimodal learning (5) temporal alignment (4) reinforcement learning (3) instructional video (3) weakly supervised learning (3) graph neural network (3) language model (3) human pose estimation (3) video analysis (3) feature learning (2) trajectory prediction (2) contrastive learning (2) multi-modal learning (2) knowledge distillation (2) self-supervised learning (2)

Papers

ViUniT: Visual Unit Tests for More Robust Visual Programming CVPR 2025 xLAM: A Family of Large Action Models to Empower AI Agent Systems NAACL 2025 Understanding Complexity in VideoQA via Visual Program Generation ICML 2025 Unifying Specialized Visual Encoders for Video Language Models ICML 2025 Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas ICML 2025 LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback ACL 2025 UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation ICCV 2025 Re-thinking Temporal Search for Long-Form Video Understanding CVPR 2025 Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D EMNLP 2025 ActionStudio: A Lightweight Framework for Data and Training of Large Action Models EMNLP 2025 LATTE: Learning to Think with Vision Specialists EMNLP 2025 PRACT: Optimizing Principled Reasoning and Acting of LLM Agent CONLL 2024 "X-InstructBLIP: A Framework for Aligning Image, 3D, Audio, Video to LLMs and its Emergent Cross-modal Reasoning" ECCV 2024 LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer ECCV 2024 ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding CVPR 2024 IKEA Manuals at Work: 4D Grounding of Assembly Instructions on Internet Videos NIPS 2024 Streaming Detection of Queried Event Start NIPS 2024 On the Unlikelihood of D-Separation PGM 2024 Causal Layering via Conditional Entropy CLEAR 2024 PRACT: Optimizing Principled Reasoning and Acting of LLM Agent EMNLP 2024 Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization ICLR 2024 APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets NIPS 2024 Mask-Free OVIS: Open-Vocabulary Instance Segmentation Without Manual Mask Annotations CVPR 2023 ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding CVPR 2023 PreViTS: Contrastive Pretraining With Video Tracking Supervision WACV 2023 Temporally Disentangled Representation Learning under Unknown Nonstationarity NIPS 2023 UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild NIPS 2023 Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation ICCV 2023 Procedure-Aware Pretraining for Instructional Video Understanding CVPR 2023 PrivHAR: Recognizing Human Actions from Privacy-Preserving Lens ECCV 2022 MOMA-LRG: Language-Refined Graphs for Multi-Object Multi-Actor Activity Parsing NIPS 2022 Align and Prompt: Video-and-Language Pre-Training With Entity Prompts CVPR 2022 Revisiting the "Video" in Video-Language Understanding CVPR 2022 Open Vocabulary Object Detection with Pseudo Bounding-Box Labels ECCV 2022 Metadata Normalization CVPR 2021 MOMA: Multi-Object Multi-Actor Activity Parsing NIPS 2021 TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild ICCV 2021 Detecting Human-Object Relationships in Videos ICCV 2021 Learning Privacy-Preserving Optics for Human Pose Estimation ICCV 2021 Representation Learning With Statistical Independence to Mitigate Bias WACV 2021 Home Action Genome: Cooperative Compositional Action Understanding CVPR 2021 Spatio-Temporal Graph for Video Captioning With Knowledge Distillation CVPR 2020 Few-Shot Video Classification via Temporal Alignment CVPR 2020 Disentangling Human Dynamics for Pedestrian Locomotion Forecasting with Noisy Supervision WACV 2020 Adversarial Cross-Domain Action Recognition with Co-Attention AAAI 2020 RubiksNet: Learnable 3D-Shift for Efficient Video Action Recognition ECCV 2020 Procedure Planning in Instructional Videos ECCV 2020 Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs CVPR 2020 D3TW: Discriminative Differentiable Dynamic Time Warping for Weakly Supervised Action Alignment and Segmentation CVPR 2019 Learning Temporal Action Proposals With Fewer Labels ICCV 2019 Imitation Learning for Human Pose Prediction ICCV 2019 Neural Task Graphs: Generalizing to Unseen Tasks From a Single Video Demonstration CVPR 2019 Peeking Into the Future: Predicting Future Person Activities and Locations in Videos CVPR 2019 Liquid Pouring Monitoring via Rich Sensory Inputs ECCV 2018 Temporal Modular Networks for Retrieving Complex Compositional Activities in Videos ECCV 2018 Learning to Decompose and Disentangle Representations for Video Prediction NIPS 2018 Translating Navigation Instructions in Natural Language to a High-Level Plan for Behavioral Robot Navigation EMNLP 2018 What Makes a Video a Video: Analyzing Temporal Information in Video Understanding Models and Datasets CVPR 2018 Finding "It": Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos CVPR 2018 Graph Distillation for Action Detection with Privileged Modalities ECCV 2018 End-to-End Joint Semantic Segmentation of Actors and Actions in Video ECCV 2018 Visual Forecasting by Imitating Dynamics in Natural Sequences ICCV 2017 Dense-Captioning Events in Videos ICCV 2017 Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization CVPR 2017 SST: Single-Stream Temporal Action Proposals CVPR 2017 Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos CVPR 2017 A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets CVPR 2016 Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos CVPR 2016 ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding CVPR 2015 Robust Manhattan Frame Estimation From a Single RGB-D Image CVPR 2015 On the Relationship Between Visual Attributes and Convolutional Networks CVPR 2015 Discriminative Hierarchical Modeling of Spatio-Temporally Composable Human Activities CVPR 2014