Xizhou Zhu

48 papers · 2017–2025 · 6 conferences · across top CS/AI conferences

Achievements

+14 more ↓

🏃 Academic Marathon (8) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (6) 🧭 Keyword Pioneer 🐝 Cross-Pollinator (12)

🌈 Renaissance Researcher (6) 🐝 Cross-Pollinator (12) 🌍 Conference Polyglot (6) 🏠 Conference Loyalist (20) 🤝 Dynamic Duo (46) 👥 Mega-Team (38) 🔬 Deep Specialist (11) 🧬 Topic Evolution 🏆 Keyword Champion (4) 🗃️ Keyword Collector (157) 📈 Trend Setter 🔥 Unstoppable (9) 💎 Century Club (48) ⚡ Prolific Year (12)

Conferences

CVPR (20) ICLR (9) NIPS (8) ECCV (5) ICCV (5) ICML (1)

Top co-authors

Jifeng Dai (46) Yu Qiao (25) Lewei Lu (22) Wenhai Wang (19) Gao Huang (11) hongsheng Li (11) Zhe Chen (11) Xiaogang Wang (10) Hao Li (10) Jie Zhou (8)

Keywords

object detection (7) vision-language model (7) large language model (5) visual representation (4) multimodal large language model (4) multimodal learning (4) deformable convolution (4) semantic segmentation (3) multi-task learning (3) transfer learning (3) image generation (3) foundation model (3) self-supervised learning (3) convolutional neural network (3) multi-modal learning (2) zero-shot learning (2) autonomous driving (2) neural network optimization (2) visual question answering (2) representation learning (2)

Papers

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training CVPR 2025 Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy ICCV 2025 LangBridge: Interpreting Image as a Combination of Language Embeddings ICCV 2025 CoMemo: LVLMs Need Image Context with Image Memory ICML 2025 PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models CVPR 2025 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding CVPR 2025 V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding ICCV 2025 MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models ICLR 2025 Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures ICLR 2025 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text ICLR 2025 HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding CVPR 2025 Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft CVPR 2024 Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications CVPR 2024 The All-Seeing Project V2: Towards General Relation Comprehension of the Open World ECCV 2024 ControlLLM: Augment Language Models with Tools by Searching on Graphs ECCV 2024 Parameter-Inverted Image Pyramid Networks NIPS 2024 ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process ICLR 2024 The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World ICLR 2024 Needle In A Multimodal Haystack NIPS 2024 Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning NIPS 2024 Learning 1D Causal Visual Representation with De-focus Attention Networks NIPS 2024 VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks NIPS 2024 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks CVPR 2024 Siamese Image Modeling for Self-Supervised Vision Representation Learning CVPR 2023 VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks NIPS 2023 Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information CVPR 2023 Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks CVPR 2023 BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision CVPR 2023 InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions CVPR 2023 Planning-Oriented Autonomous Driving CVPR 2023 VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition ECCV 2022 DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation ECCV 2022 Exploring the Equivalence of Siamese Self-Supervised Learning via a Unified Gradient Framework CVPR 2022 AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks CVPR 2022 Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks CVPR 2022 Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs NIPS 2022 Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation ICLR 2021 Deformable DETR: Deformable Transformers for End-to-End Object Detection ICLR 2021 Searching Parameterized AP Loss for Object Detection NIPS 2021 Unsupervised Object Detection With LIDAR Clues CVPR 2021 VL-BERT: Pre-training of Generic Visual-Linguistic Representations ICLR 2020 Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation ICLR 2020 Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation ECCV 2020 Deformable ConvNets V2: More Deformable, Better Results CVPR 2019 An Empirical Study of Spatial Attention Mechanisms in Deep Networks ICCV 2019 Towards High Performance Video Object Detection CVPR 2018 Flow-Guided Feature Aggregation for Video Object Detection ICCV 2017 Deep Feature Flow for Video Recognition CVPR 2017