Xiaojian Ma

31 papers · 2019–2026 · 9 conferences · across top CS/AI conferences

Achievements

+13 more ↓

🧭 Keyword Pioneer 🌍 Conference Polyglot (9) 🗺️ Taxonomy Completionist (10) 🌉 Interdisciplinary Bridge 🏃 Academic Marathon (6)

🐝 Cross-Pollinator (13) 🗺️ Taxonomy Completionist (10) 🧭 Keyword Pioneer 🏆 Grand Slam 👑 Triple Crown 🤝 Dynamic Duo (13) 🧬 Topic Evolution 🔥 Unstoppable (7) 📈 Trend Setter 🚀 Conference Pioneer ⚡ Prolific Year (11) 🗃️ Keyword Collector (128) 💎 Century Club (30)

Conferences

ICLR (7) NIPS (5) AAAI (4) CVPR (4) ICML (4) ICCV (3) ECCV (2) ACL (1) NAACL (1)

Top co-authors

Qing Li (14) Song-chun Zhu (10) Yitao Liang (7) Siyuan Huang (6) Zihao Wang (6) Anji Liu (5) Baoxiong Jia (5) Shaofei Cai (5) Wenbing Huang (4) Rujie Wu (4)

Keywords

imitation learning (4) instruction following (2) embodied ai (2) large language model (2) scene understanding (2) visual grounding (2) vision-language model (2) energy-based model (2) latent diffusion (2) reinforcement learning (2) foundation model (2) bayesian learning (1) image segmentation (1) transfer learning (1) variational inference (1) policy optimization (1) continual learning (1) generative modeling (1) visual question answering (1) autoregressive transformer (1)

Papers

TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents AAAI 2026 ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting CVPR 2025 JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse ACL 2025 GROOT-2: Weakly Supervised Multimodal Instruction Following Agents ICLR 2025 Falcon: Fast Visuomotor Policies via Partial Denoising ICML 2025 Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding ICCV 2025 Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation ICCV 2025 Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage ICLR 2025 VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding ECCV 2024 Unifying 3D Vision-Language Understanding via Promptable Queries ECCV 2024 UltraEdit: Instruction-based Fine-Grained Image Editing at Scale NIPS 2024 GROOT: Learning to Follow Instructions by Watching Gameplay Videos ICLR 2024 Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World ICLR 2024 MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning ICLR 2024 OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents NIPS 2024 Multi-modal Situated Reasoning in 3D Scenes NIPS 2024 MindAgent: Emergent Gaming Interaction NAACL 2024 An Embodied Generalist Agent in 3D World ICML 2024 CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update CVPR 2024 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment ICCV 2023 Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction CVPR 2023 SQA3D: Situated Question Answering in 3D Scenes ICLR 2023 RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning ICLR 2022 Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions CVPR 2022 Latent Diffusion Energy-Based Model for Interpretable Text Modelling ICML 2022 Adversarial Option-Aware Hierarchical Imitation Learning ICML 2021 Unsupervised Foreground Extraction via Deep Region Competition NIPS 2021 Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance AAAI 2020 Theory-Based Causal Transfer:Integrating Instance-Level Induction and Abstract-Level Structure Learning AAAI 2020 Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement NIPS 2019 Task Transfer by Preference-Based Cost Learning AAAI 2019