Xiaojian Ma
31 papers · 2019–2026 · 9 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+13 more ↓ Show less ↑
π§ Keyword Pioneer π Conference Polyglot (9) πΊοΈ Taxonomy Completionist (10) π Interdisciplinary Bridge π Academic Marathon (6)
π
Cross-Pollinator
(13)
πΊοΈ
Taxonomy Completionist
(10)
π§
Keyword Pioneer
π
Grand Slam
π
Triple Crown
π€
Dynamic Duo
(13)
π§¬
Topic Evolution
π₯
Unstoppable
(7)
π
Trend Setter
π
Conference Pioneer
β‘
Prolific Year
(11)
ποΈ
Keyword Collector
(128)
π
Century Club
(30)
Conferences
ICLR (7)
NIPS (5)
AAAI (4)
CVPR (4)
ICML (4)
ICCV (3)
ECCV (2)
ACL (1)
NAACL (1)
Top co-authors
Keywords
imitation learning
(4)
instruction following
(2)
embodied ai
(2)
large language model
(2)
scene understanding
(2)
visual grounding
(2)
vision-language model
(2)
energy-based model
(2)
latent diffusion
(2)
reinforcement learning
(2)
foundation model
(2)
bayesian learning
(1)
image segmentation
(1)
transfer learning
(1)
variational inference
(1)
policy optimization
(1)
continual learning
(1)
generative modeling
(1)
visual question answering
(1)
autoregressive transformer
(1)
Papers
TongUI: Internet-Scale Trajectories from Multimodal Web Tutorials for Generalized GUI Agents
AAAI 2026
ROCKET-1: Mastering Open-World Interaction with Visual-Temporal Context Prompting
CVPR 2025
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse
ACL 2025
GROOT-2: Weakly Supervised Multimodal Instruction Following Agents
ICLR 2025
Falcon: Fast Visuomotor Policies via Partial Denoising
ICML 2025
Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding
ICCV 2025
Move to Understand a 3D Scene: Bridging Visual Grounding and Exploration for Efficient and Versatile Embodied Navigation
ICCV 2025
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage
ICLR 2025
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding
ECCV 2024
Unifying 3D Vision-Language Understanding via Promptable Queries
ECCV 2024
UltraEdit: Instruction-based Fine-Grained Image Editing at Scale
NIPS 2024
GROOT: Learning to Follow Instructions by Watching Gameplay Videos
ICLR 2024
Bongard-OpenWorld: Few-Shot Reasoning for Free-form Visual Concepts in the Real World
ICLR 2024
MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
ICLR 2024
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
NIPS 2024
Multi-modal Situated Reasoning in 3D Scenes
NIPS 2024
MindAgent: Emergent Gaming Interaction
NAACL 2024
An Embodied Generalist Agent in 3D World
ICML 2024
CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update
CVPR 2024
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
ICCV 2023
Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction
CVPR 2023
SQA3D: Situated Question Answering in 3D Scenes
ICLR 2023
RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning
ICLR 2022
Bongard-HOI: Benchmarking Few-Shot Visual Reasoning for Human-Object Interactions
CVPR 2022
Latent Diffusion Energy-Based Model for Interpretable Text Modelling
ICML 2022
Adversarial Option-Aware Hierarchical Imitation Learning
ICML 2021
Unsupervised Foreground Extraction via Deep Region Competition
NIPS 2021
Reinforcement Learning from Imperfect Demonstrations under Soft Expert Guidance
AAAI 2020
Theory-Based Causal Transfer:Integrating Instance-Level Induction and Abstract-Level Structure Learning
AAAI 2020
Imitation Learning from Observations by Minimizing Inverse Dynamics Disagreement
NIPS 2019
Task Transfer by Preference-Based Cost Learning
AAAI 2019