Siyuan Huang

82 papers · 2017–2025 · 14 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🏃 Academic Marathon (8) 🌍 Conference Polyglot (14) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🐝 Cross-Pollinator (8)

🐝 Cross-Pollinator (8) 🌈 Renaissance Researcher (11) 🗺️ Taxonomy Completionist (115) 🏠 Conference Loyalist (21) 🤝 Dynamic Duo (32) 👑 Triple Crown 👥 Mega-Team (34) 🏆 Keyword Champion (5) 🏆 Grand Slam 🔬 Deep Specialist (19) ⚡ Prolific Year (5) 🚀 Conference Pioneer 🔥 Unstoppable (9) 🗃️ Keyword Collector (316) 📈 Trend Setter 💎 Century Club (82)

Conferences

CVPR (21) ICCV (15) NIPS (9) ECCV (8) ICLR (8) AAAI (4) ACL (4) CORL (4) ICML (3) EMNLP (2) IJCAI (1) IJCNLP (1) RSS (1) WACV (1)

Top co-authors

Song-chun Zhu (32) Yixin Chen (23) Baoxiong Jia (22) Yixin Zhu (22) Qing Li (16) Tengyu Liu (14) hongsheng Li (8) Puhao Li (8) Zhouhan Lin (6) Wei Liang (6)

Research topics

Probability (1)

Keywords

diffusion model (6) 3d scene understanding (5) visual grounding (4) video understanding (4) imitation learning (4) robotic manipulation (4) symbolic reasoning (4) scene understanding (4) contrastive learning (4) 3d reconstruction (4) zero-shot learning (3) scene reconstruction (3) multimodal learning (3) sim-to-real transfer (3) embodied ai (3) 3d vision (3) human-object interaction (3) motion synthesis (3) object detection (3) question answering (3)

Papers

GUIOdyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices ICCV 2025 Learning a Unified Policy for Position and Force Control in Legged Loco-Manipulation CORL 2025 ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models CORL 2025 CLONE: Closed-Loop Whole-Body Humanoid Teleoperation for Long-Horizon Tasks CORL 2025 Gumbel Reranking: Differentiable End-to-End Reranker Optimization ACL 2025 AMEX: Android Multi-annotation Expo Dataset for Mobile GUI Agents ACL 2025 ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning CVPR 2025 Decompositional Neural Scene Reconstruction with Generative Diffusion Prior CVPR 2025 METASCENES: Towards Automated Replica Creation for Real-world 3D Scans CVPR 2025 InteractAnything: Zero-shot Human Object Interaction Synthesis via LLM Feedback and Object Affordance Parsing CVPR 2025 Masked Point-Entity Contrast for Open-Vocabulary 3D Scene Understanding CVPR 2025 Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis CVPR 2025 Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation CVPR 2025 Dynamic Motion Blending for Versatile Motion Editing CVPR 2025 GROVE: A Generalized Reward for Learning Open-Vocabulary Physical Skill CVPR 2025 MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes CVPR 2025 Training LLMs to be Better Text Embedders through Bidirectional Reconstruction EMNLP 2025 PrimHOI: Compositional Human-Object Interaction via Reusable Primitives ICCV 2025 Trace3D: Consistent Segmentation Lifting via Gaussian Instance Tracing ICCV 2025 GWM: Towards Scalable Gaussian World Models for Robotic Manipulation ICCV 2025 Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation ICCV 2025 TACO: Taming Diffusion for in-the-wild Video Amodal Completion ICCV 2025 PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions ICLR 2025 Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want ICLR 2025 Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting ICLR 2025 RoboVerse: A Unified Platform, Benchmark and Dataset for Scalable and Generalizable Robot Learning RSS 2025 VILLS : Video-Image Learning to Learn Semantics for Person Re-Identification WACV 2025 F-HOI: Toward Fine-grained Semantic-Aligned 3D Human-Object Interactions ECCV 2024 Unifying 3D Vision-Language Understanding via Promptable Queries ECCV 2024 "SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models" ECCV 2024 Mirror-Consistency: Harnessing Inconsistency in Majority Voting EMNLP 2024 Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance CVPR 2024 Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models ACL 2024 SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models ICML 2024 3D Vision and Language Pretraining with Large-Scale Synthetic Data IJCAI 2024 A3VLM: Actionable Articulation-Aware Vision Language Model CORL 2024 Multi-modal Situated Reasoning in 3D Scenes NIPS 2024 An Embodied Generalist Agent in 3D World ICML 2024 Scaling Up Dynamic Human-Scene Interaction Modeling CVPR 2024 AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents CVPR 2024 PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI CVPR 2024 Graph Parsing Networks ICLR 2024 Cluster-wise Graph Transformer with Dual-granularity Kernelized Attention NIPS 2024 PhyRecon: Physically Plausible Neural Scene Reconstruction NIPS 2024 Neural-Symbolic Recursive Machine for Systematic Generalization ICLR 2024 SlotLifter: Slot-guided Feature Lifting for Learning Object-Centric Radiance Fields ECCV 2024 SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding ECCV 2024 GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts CVPR 2023 Improving Object-centric Learning with Query Optimization ICLR 2023 SQA3D: Situated Question Answering in 3D Scenes ICLR 2023 A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics ICLR 2023 ProBio: A Protocol-guided Multimodal Dataset for Molecular Biology Lab NIPS 2023 Tailoring Self-Attention for Graph via Rooted Subtrees NIPS 2023 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment ICCV 2023 ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes ICCV 2023 Full-Body Articulated Human-Object Interaction ICCV 2023 Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners CVPR 2023 Diffusion-Based Generation, Optimization, and Planning in 3D Scenes CVPR 2023 HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes NIPS 2022 Adversarial Texture for Fooling Person Detectors in the Physical World CVPR 2022 Learning V1 Simple Cells with Vector Representation of Local Content and Matrix Representation of Local Motion AAAI 2022 Infrared Invisible Clothing: Hiding From Infrared Detectors at Multiple Angles in Real World CVPR 2022 EgoTaskQA: Understanding Human Tasks in Egocentric Videos NIPS 2022 YouRefIt: Embodied Reference Understanding With Language and Gesture ICCV 2021 Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning ACL 2021 Spatio-Temporal Self-Supervised Representation Learning for 3D Point Clouds ICCV 2021 VLGrammar: Grounded Grammar Induction of Vision and Language ICCV 2021 Learning Neural Representation of Camera Pose with Matrix Representation of Pose Shift via View Synthesis CVPR 2021 Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning IJCNLP 2021 Learning by Fixing: Solving Math Word Problems with Weak Supervision AAAI 2021 SMART: A Situation Model for Algebra Story Problems via Attributed Grammar AAAI 2021 Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning ICML 2020 Streaming Batch Gradient Tracking for Neural Network Training (Student Abstract) AAAI 2020 LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities ECCV 2020 A Competence-aware Curriculum for Visual Concepts Learning via Question Answering ECCV 2020 Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense ICCV 2019 Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning ICCV 2019 PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points NIPS 2019 Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image ECCV 2018 Human-Centric Indoor Scene Synthesis Using Stochastic Grammar CVPR 2018 Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation NIPS 2018 Predicting Human Activities Using Stochastic Grammar ICCV 2017