Song-chun Zhu

162 papers · 2011–2026 · 14 conferences · across top CS/AI conferences

Achievements

+19 more ↓

🗺️ Taxonomy Completionist (11) 🧭 Keyword Pioneer 🌈 Renaissance Researcher (6) 🌉 Interdisciplinary Bridge 🐣 Hot Topic Early Bird

🏃 Academic Marathon (14) 🌈 Renaissance Researcher (6) 🌉 Interdisciplinary Bridge 🏠 Conference Loyalist (27) 🌟 Keyword Trendsetter Combo (8) 🤝 Dynamic Duo (42) 👑 Triple Crown 🌱 Topic Pioneer 🏆 Grand Slam 🔬 Deep Specialist (29) 🧬 Topic Evolution 🏆 Keyword Champion (4) 🔥 Unstoppable (13) ⚡ Prolific Year (14) ❓ The Questioner 💎 Century Club (158) 🗃️ Keyword Collector (638) 📈 Trend Setter 🚀 Conference Pioneer

Conferences

CVPR (42) NIPS (27) ICCV (22) AAAI (19) ICLR (12) ACL (9) ICML (9) ECCV (6) EMNLP (5) IJCAI (4) IJCNLP (3) CORL (2) NAACL (1) RSS (1)

Top co-authors

Ying Nian Wu (42) Siyuan Huang (32) Yixin Zhu (32) Qing Li (18) Baoxiong Jia (17) Zilong Zheng (15) Siyuan Qi (14) Jianwen Xie (13) Ruiqi Gao (12) Xiaojian Ma (10)

Research topics

Probability (1)

Keywords

energy-based model (17) generative model (17) markov chain monte carlo (14) object detection (9) scene understanding (9) video understanding (8) and-or graph (7) reinforcement learning (6) unsupervised learning (6) visual reasoning (6) large language model (6) representation learning (6) convolutional neural network (6) 3d reconstruction (5) multi-agent system (5) symbolic reasoning (5) visual question answering (5) neural network (5) human pose estimation (4) hierarchical model (4)

Papers

v-HUB: A Benchmark for Video Humor Understanding from Vision and Sound ACL 2026 TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents AAAI 2026 JurisBench: A Deep Benchmark for Assessing Large Language Models in Professional Legal Practice ACL 2026 Video Echoed in Music: Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation AAAI 2026 ControlVLA: Few-shot Object-centric Adaptation for Pre-trained Vision-Language-Action Models CORL 2025 Differentiable Information Enhanced Model-Based Reinforcement Learning AAAI 2025 Enhancing LLM-Based Social Bot via an Adversarial Learning Framework EMNLP 2025 Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage ICLR 2025 Building Interactable Replicas of Complex Articulated Objects via Gaussian Splatting ICLR 2025 ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection ACL 2025 Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis CVPR 2025 Decompositional Neural Scene Reconstruction with Generative Diffusion Prior CVPR 2025 METASCENES: Towards Automated Replica Creation for Real-world 3D Scans CVPR 2025 Fast Peer Adaptation with Context-aware Exploration ICML 2024 INTERPRET: Interactive Predicate Learning from Language Feedback for Generalizable Task Planning RSS 2024 ProAgent: Building Proactive Cooperative Agents with Large Language Models AAAI 2024 FIRE: A Dataset for Feedback Integration and Refinement Evaluation of Multimodal Models NIPS 2024 Learning to Balance Altruism and Self-interest Based on Empathy in Mixed-Motive Games NIPS 2024 RulE: Knowledge Graph Reasoning with Rule Embedding ACL 2024 CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update CVPR 2024 LangSuit·E: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments ACL 2024 Neural-Symbolic Recursive Machine for Systematic Generalization ICLR 2024 Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World ICLR 2024 CivRealm: A Learning and Reasoning Odyssey in Civilization for Decision-Making Agents ICLR 2024 Mars: Situated Inductive Reasoning in an Open-World Environment NIPS 2024 PhyRecon: Physically Plausible Neural Scene Reconstruction NIPS 2024 AdaSociety: An Adaptive Environment with Social Structures for Multi-Agent Decision-Making NIPS 2024 Efficient Adaptation in Mixed-Motive Environments via Hierarchical Opponent Modeling and Planning ICML 2024 An Embodied Generalist Agent in 3D World ICML 2024 A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics ICLR 2023 Learning non-Markovian Decision-Making from State-only Sequences NIPS 2023 Evaluating and Inducing Personality in Pre-trained Language Models NIPS 2023 Learning Energy-Based Prior Model with Diffusion-Amortized MCMC NIPS 2023 Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models NIPS 2023 Diplomat: A Dialogue Dataset for Situated PragMATic Reasoning NIPS 2023 On the Complexity of Bayesian Generalization ICML 2023 SQA3D: Situated Question Answering in 3D Scenes ICLR 2023 Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning ICLR 2023 ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes ICCV 2023 X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events ICCV 2023 Diffusion-Based Generation, Optimization, and Planning in 3D Scenes CVPR 2023 Learning from the Tangram to Solve Mini Visual Tasks AAAI 2022 Learning V1 Simple Cells with Vector Representation of Local Content and Matrix Representation of Local Motion AAAI 2022 ValueNet: A New Dataset for Human Value Driven Dialogue System AAAI 2022 MATE: Benchmarking Multi-Agent Reinforcement Learning in Distributed Target Coverage Control NIPS 2022 Emergent Graphical Conventions in a Visual Communication Game NIPS 2022 Towards Human-Level Bimanual Dexterous Manipulation with Reinforcement Learning NIPS 2022 EgoTaskQA: Understanding Human Tasks in Egocentric Videos NIPS 2022 Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering NIPS 2022 Latent Diffusion Energy-Based Model for Interpretable Text Modelling ICML 2022 COAT: Measuring Object Compositionality in Emergent Representations ICML 2022 Learning Algebraic Representation for Systematic Generalization in Abstract Reasoning ECCV 2022 RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning ICLR 2022 MCMC Should Mix: Learning Energy-Based Model with Neural Transport Latent Space MCMC ICLR 2022 Learning Probabilistic Models from Generator Latent Spaces with Hat EBM NIPS 2022 GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning IJCNLP 2021 Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning IJCNLP 2021 SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues IJCNLP 2021 Generative PointNet: Deep Energy-Based Learning on Unordered Point Sets for 3D Generation, Reconstruction and Classification CVPR 2021 ACRE: Abstract Causal REasoning Beyond Covariation CVPR 2021 Mind the Context: The Impact of Contextualization in Neural Module Networks for Grounding Visual Referring Expressions EMNLP 2021 Learning Neural Representation of Camera Pose with Matrix Representation of Pose Shift via View Synthesis CVPR 2021 Learning by Fixing: Solving Math Word Problems with Weak Supervision AAAI 2021 Learning Cycle-Consistent Cooperative Networks via Alternating MCMC Teaching for Unsupervised Cross-Domain Translation AAAI 2021 Stochastic Security: Adversarial Defense Using Long-Run Dynamics of Energy-Based Models ICLR 2021 Spatio-Temporal Self-Supervised Representation Learning for 3D Point Clouds ICCV 2021 YouRefIt: Embodied Reference Understanding With Language and Gesture ICCV 2021 VLGrammar: Grounded Grammar Induction of Vision and Language ICCV 2021 Unsupervised Foreground Extraction via Deep Region Competition NIPS 2021 On Path Integration of Grid Cells: Group Representation and Isotropic Scaling NIPS 2021 Iterative Teacher-Aware Learning NIPS 2021 SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues ACL 2021 Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning ACL 2021 GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning ACL 2021 Robust Visual Reasoning via Language Guided Neural Module Networks NIPS 2021 SMART: A Situation Model for Algebra Story Problems via Attributed Grammar AAAI 2021 CrossVQA: Scalably Generating Benchmarks for Systematically Testing VQA Generalization EMNLP 2021 Learning Triadic Belief Dynamics in Nonverbal Communication From Videos CVPR 2021 Abstract Spatial-Temporal Reasoning via Probabilistic Abduction and Execution CVPR 2021 CoCoX: Generating Conceptual and Counterfactual Explanations via Fault-Lines AAAI 2020 On the Anatomy of MCMC-Based Maximum Likelihood Learning of Energy-Based Models AAAI 2020 Closed Loop Neural-Symbolic Learning via Integrating Neural Perception, Grammar Parsing, and Symbolic Reasoning ICML 2020 Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational Reasoning AAAI 2020 Structured Attention for Unsupervised Dialogue Structure Induction EMNLP 2020 Motion-Based Generator Model: Unsupervised Disentanglement of Appearance, Trackable and Intrackable Motions in Dynamic Patterns AAAI 2020 LEMMA: A Multi-view Dataset for LEarning Multi-agent Multi-task Activities ECCV 2020 Learning Multi-layer Latent Variable Model via Variational Optimization of Short Run MCMC for Approximate Inference ECCV 2020 A Competence-aware Curriculum for Visual Concepts Learning via Question Answering ECCV 2020 Theory-Based Causal Transfer:Integrating Instance-Level Induction and Abstract-Level Structure Learning AAAI 2020 Words Aren’t Enough, Their Order Matters: On the Robustness of Grounding Visual Referring Expressions ACL 2020 Joint Training of Variational Auto-Encoder and Latent Energy-Based Model CVPR 2020 Learning Latent Space Energy-Based Prior Model NIPS 2020 Inducing Hierarchical Compositional Model by Sparsifying Generator Network CVPR 2020 DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare ICCV 2019 Learning Non-Convergent Non-Persistent Short-Run MCMC Toward Energy-Based Model NIPS 2019 Learning Perceptual Inference by Contrasting NIPS 2019 PerspectiveNet: 3D Object Detection from a Single RGB Image via Perspective Points NIPS 2019 MetaStyle: Three-Way Trade-off among Speed, Flexibility, and Quality in Neural Style Transfer AAAI 2019 Learning Dynamic Generator Model by Alternating Back-Propagation through Time AAAI 2019 Mirroring without Overimitation: Learning Functionally Equivalent Manipulation Actions AAAI 2019 Recognizing Unseen Attribute-Object Pair with Generative Model AAAI 2019 RAVEN: A Dataset for Relational and Analogical Visual REasoNing CVPR 2019 Reasoning Visual Dialogs With Structural and Partial Observations CVPR 2019 Divergence Triangle for Joint Training of Generator Model, Energy-Based Model, and Inferential Model CVPR 2019 Unsupervised Disentangling of Appearance and Geometry by Deformable Generator Network CVPR 2019 Understanding Human Gaze Communication by Spatio-Temporal Graph Reasoning ICCV 2019 Holistic++ Scene Understanding: Single-View 3D Holistic Scene Parsing and Human Pose Estimation With Human-Object Interaction and Physical Commonsense ICCV 2019 Learning Grid Cells as Vector Representation of Self-Position Coupled with Matrix Representation of Self-Motion ICLR 2019 Inferring Shared Attention in Social Scene Videos CVPR 2018 Learning Descriptor Networks for 3D Shape Synthesis and Analysis CVPR 2018 Cooperative Holistic Scene Understanding: Unifying 3D Object, Layout, and Camera Pose Estimation NIPS 2018 Learning Human-Object Interactions by Graph Parsing Neural Networks ECCV 2018 Holistic 3D Scene Parsing and Reconstruction from a Single RGB Image ECCV 2018 Learning Generative ConvNets via Multi-Grid Modeling and Sampling CVPR 2018 Interpretable Convolutional Neural Networks CVPR 2018 Where and Why Are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks CVPR 2018 Generalized Earley Parser: Bridging Symbolic Grammars and Sequence Data for Future Prediction ICML 2018 A Causal And-Or Graph Model for Visibility Fluent Reasoning in Tracking Interacting Objects CVPR 2018 Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification CVPR 2018 Human-Centric Indoor Scene Synthesis Using Stochastic Grammar CVPR 2018 Mining Object Parts From CNNs via Active Question-Answering CVPR 2017 Single-Image 3D Scene Parsing Using Geometric Commonsense IJCAI 2017 Inferring Human Attention by Learning Latent Intentions IJCAI 2017 Predicting Human Activities Using Stochastic Grammar ICCV 2017 Jointly Recognizing Object Fluents and Tasks in Egocentric Videos ICCV 2017 Monocular 3D Human Pose Estimation by Predicting Depth on Joints ICCV 2017 Learning Human Utility from Video Demonstrations for Deductive Planning in Robotics CORL 2017 Generative Hierarchical Learning of Sparse FRAME Models CVPR 2017 Synthesizing Dynamic Patterns by Spatial-Temporal Generative ConvNet CVPR 2017 CERN: Confidence-Energy Recurrent Network for Group Activity Recognition CVPR 2017 What Is Where: Inferring Containment Relations from Videos IJCAI 2016 Learning Social Affordance for Human-Robot Interaction IJCAI 2016 A Theory of Generative ConvNet ICML 2016 Inferring Forces and Learning Human Utilities From Videos CVPR 2016 Multi-View People Tracking via Hierarchical Trajectory Composition CVPR 2016 Recognizing Car Fluents From Video CVPR 2016 Grounded Semantic Role Labeling NAACL 2016 Jointly Learning Grounded Task Structures from Language Instruction and Visual Demonstration EMNLP 2016 Mining And-Or Graphs for Graph Matching and Object Discovery ICCV 2015 Attributed Grammars for Joint Estimation of Human Attributes, Part and Pose ICCV 2015 Automated Facial Trait Judgment and Election Outcome Prediction: Social Dimensions of Face ICCV 2015 Joint Action Recognition and Pose Estimation From Video CVPR 2015 Learning Inhomogeneous FRAME Models for Object Patterns CVPR 2014 Single-View 3D Scene Parsing by Attributed Grammar CVPR 2014 Visual Persuasion: Inferring Communicative Intents of Images CVPR 2014 Cross-view Action Modeling, Learning and Recognition CVPR 2014 Unsupervised Learning of Dictionaries of Hierarchical Compositional Models CVPR 2014 Modeling 4D Human-Object Interactions for Event and Object Recognition ICCV 2013 Beyond Point Clouds: Scene Understanding by Reasoning Geometry and Physics CVPR 2013 Scene Parsing by Integrating Function, Geometry and Appearance Models CVPR 2013 Integrating Grammar and Segmentation for Human Pose Estimation CVPR 2013 Concurrent Action Detection with Structural Prediction ICCV 2013 Discriminatively Trained And-Or Tree Models for Object Detection CVPR 2013 Weakly Supervised Learning for Attribute Localization in Outdoor Scenes CVPR 2013 Inferring "Dark Matter" and "Dark Energy" from Videos ICCV 2013 Cosegmentation and Cosketch by Unsupervised Learning ICCV 2013 Unsupervised Structure Learning of Stochastic And-Or Grammars NIPS 2013 Monte Carlo Tree Search for Scheduling Activity Recognition ICCV 2013 Modeling Occlusion by Discriminative AND-OR Structures ICCV 2013 Learning Near-Optimal Cost-Sensitive Decision Policy for Object Detection ICCV 2013 Human Attribute Recognition by Rich Appearance Dictionary ICCV 2013 Image Parsing with Stochastic Scene Grammar NIPS 2011