Xinlei Chen

55 papers · 2013–2026 · 12 conferences · across top CS/AI conferences

Achievements

+14 more ↓

🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (11) 🏃 Academic Marathon (12) 🌍 Conference Polyglot (11) 🗺️ Taxonomy Completionist (91)

🏃 Academic Marathon (12) 🧭 Keyword Pioneer 🗺️ Taxonomy Completionist (91) 🌟 Keyword Trendsetter Combo (5) 🔬 Deep Specialist (12) 🏆 Grand Slam 🧬 Topic Evolution ⚡ Prolific Year (5) ❓ The Questioner 🔥 Unstoppable (13) 🗃️ Keyword Collector (219) 💎 Century Club (53) 🚀 Conference Pioneer 📈 Trend Setter

Conferences

CVPR (18) ICCV (13) ICLR (5) ICML (5) ACL (4) NIPS (3) AAAI (2) ECCV (1) EMNLP (1) IJCAI (1) JMLR (1) NAACL (1)

Top co-authors

Kaiming He (8) Devi Parikh (7) Marcus Rohrbach (7) Saining Xie (7) Abhinav Gupta (7) Dhruv Batra (6) Chen Gao (6) Yuandong Tian (5) Zhuang Liu (5) Jirong Zha (4)

Research topics

Computer Vision (1) Statistics (1)

Keywords

self-supervised learning (9) multimodal learning (8) representation learning (6) visual question answering (6) object detection (5) contrastive learning (4) large language model (4) convolutional neural network (4) vision transformer (3) visual representation (3) knowledge distillation (3) image captioning (3) masked autoencoder (3) transfer learning (3) video understanding (2) visual grounding (2) point cloud (2) domain adaptation (2) transformer architecture (2) computer vision (2)

Papers

DIMM: Decoupled Multi-hierarchy Kalman Filter via Reinforcement Learning AAAI 2026 AirCopBench: A Benchmark for Multi-drone Collaborative Embodied Perception and Reasoning AAAI 2026 Analyzing and Modeling LLM Response Lengths with Extreme Value Theory: Anchoring Effects and Hybrid Distributions EMNLP 2025 Context-Aware Sentiment Forecasting via LLM-based Multi-Perspective Role-Playing Agents ACL 2025 CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory ACL 2025 UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces ACL 2025 Test-Time Training on Video Streams JMLR 2025 How to Enable LLM with 3D Capacity? A Survey of Spatial Reasoning in LLM IJCAI 2025 Learning to (Learn at Test Time): RNNs with Expressive Hidden States ICML 2025 Learnings from Scaling Visual Tokenizers for Reconstruction and Generation ICML 2025 Highly Compressed Tokenizer Can Generate Without Training ICML 2025 LLMs can see and hear without any training ICML 2025 An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels ICLR 2025 Deconstructing Denoising Diffusion Models for Self-Supervised Learning ICLR 2025 Transformers without Normalization CVPR 2025 PRE-Mamba: A 4D State Space Model for Ultra-High-Frequent Event Camera Deraining ICCV 2025 Scaling Language-Free Visual Representation Learning ICCV 2025 MetaMorph: Multimodal Understanding and Generation via Instruction Tuning ICCV 2025 On the Surprising Effectiveness of Attention Transfer for Vision Transformers NIPS 2024 Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers NIPS 2024 R-MAE: Regions Meet Masked Autoencoders ICLR 2024 Improving Selective Visual Question Answering by Learning From Your Peers CVPR 2023 UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding ICCV 2023 ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders CVPR 2023 Test-Time Training with Masked Autoencoders NIPS 2022 Masked Autoencoders Are Scalable Vision Learners CVPR 2022 On the Importance of Asymmetry for Siamese Representation Learning CVPR 2022 Point-Level Region Contrast for Object Detection Pre-Training CVPR 2022 NASViT: Neural Architecture Search for Efficient Vision Transformers with Gradient Conflict aware Supernet Training ICLR 2022 Understanding self-supervised learning dynamics without contrastive pairs ICML 2021 KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA CVPR 2021 Exploring Simple Siamese Representation Learning CVPR 2021 MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond ICLR 2021 An Empirical Study of Training Self-Supervised Vision Transformers ICCV 2021 In Defense of Grid Features for Visual Question Answering CVPR 2020 ImVoteNet: Boosting 3D Object Detection in Point Clouds With Image Votes CVPR 2020 Seeing the Un-Scene: Learning Amodal Semantic Maps for Room Navigation ECCV 2020 Grounded Video Description CVPR 2019 Order-Aware Generative Modeling Using the 3D-Craft Dataset ICCV 2019 Embodied Amodal Recognition: Learning to Move to Perceive Objects ICCV 2019 TensorMask: A Foundation for Dense Object Segmentation ICCV 2019 nocaps: novel object captioning at scale ICCV 2019 Prior-Aware Neural Network for Partially-Supervised Multi-Organ Segmentation ICCV 2019 CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication ACL 2019 Towards VQA Models That Can Read CVPR 2019 Multi-Target Embodied Question Answering CVPR 2019 Cycle-Consistency for Robust Visual Question Answering CVPR 2019 Iterative Visual Reasoning Beyond Convolutions CVPR 2018 Spatial Memory for Context Reasoning in Object Detection ICCV 2017 Visualizing and Understanding Neural Models in NLP NAACL 2016 Sense Discovery via Co-Clustering on Images and Text CVPR 2015 Mind's Eye: A Recurrent Visual Representation for Image Caption Generation CVPR 2015 Webly Supervised Learning of Convolutional Networks ICCV 2015 Enriching Visual Knowledge Bases via Object Discovery and Segmentation CVPR 2014 NEIL: Extracting Visual Knowledge from Web Data ICCV 2013