Xizhou Zhu
48 papers · 2017–2025 · 6 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+14 more ↓ Show less ↑
🏃 Academic Marathon (8) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (6) 🧭 Keyword Pioneer 🐝 Cross-Pollinator (12)
🌈
Renaissance Researcher
(6)
🐝
Cross-Pollinator
(12)
🌍
Conference Polyglot
(6)
🏠
Conference Loyalist
(20)
🤝
Dynamic Duo
(46)
👥
Mega-Team
(38)
🔬
Deep Specialist
(11)
🧬
Topic Evolution
🏆
Keyword Champion
(4)
🗃️
Keyword Collector
(157)
📈
Trend Setter
🔥
Unstoppable
(9)
💎
Century Club
(48)
⚡
Prolific Year
(12)
Conferences
CVPR (20)
ICLR (9)
NIPS (8)
ECCV (5)
ICCV (5)
ICML (1)
Top co-authors
Keywords
object detection
(7)
vision-language model
(7)
large language model
(5)
visual representation
(4)
multimodal large language model
(4)
multimodal learning
(4)
deformable convolution
(4)
semantic segmentation
(3)
multi-task learning
(3)
transfer learning
(3)
image generation
(3)
foundation model
(3)
self-supervised learning
(3)
convolutional neural network
(3)
multi-modal learning
(2)
zero-shot learning
(2)
autonomous driving
(2)
neural network optimization
(2)
visual question answering
(2)
representation learning
(2)
Papers
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
CVPR 2025
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy
ICCV 2025
LangBridge: Interpreting Image as a Combination of Language Embeddings
ICCV 2025
CoMemo: LVLMs Need Image Context with Image Memory
ICML 2025
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models
CVPR 2025
SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding
CVPR 2025
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
ICCV 2025
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
ICLR 2025
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
ICLR 2025
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
ICLR 2025
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
CVPR 2025
Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft
CVPR 2024
Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications
CVPR 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
ECCV 2024
ControlLLM: Augment Language Models with Tools by Searching on Graphs
ECCV 2024
Parameter-Inverted Image Pyramid Networks
NIPS 2024
ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process
ICLR 2024
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
ICLR 2024
Needle In A Multimodal Haystack
NIPS 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
NIPS 2024
Learning 1D Causal Visual Representation with De-focus Attention Networks
NIPS 2024
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
NIPS 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
CVPR 2024
Siamese Image Modeling for Self-Supervised Vision Representation Learning
CVPR 2023
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
NIPS 2023
Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information
CVPR 2023
Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks
CVPR 2023
BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision
CVPR 2023
InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions
CVPR 2023
Planning-Oriented Autonomous Driving
CVPR 2023
VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition
ECCV 2022
DeciWatch: A Simple Baseline for 10× Efficient 2D and 3D Pose Estimation
ECCV 2022
Exploring the Equivalence of Siamese Self-Supervised Learning via a Unified Gradient Framework
CVPR 2022
AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks
CVPR 2022
Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks
CVPR 2022
Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs
NIPS 2022
Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation
ICLR 2021
Deformable DETR: Deformable Transformers for End-to-End Object Detection
ICLR 2021
Searching Parameterized AP Loss for Object Detection
NIPS 2021
Unsupervised Object Detection With LIDAR Clues
CVPR 2021
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
ICLR 2020
Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation
ICLR 2020
Spatially Adaptive Inference with Stochastic Feature Sampling and Interpolation
ECCV 2020
Deformable ConvNets V2: More Deformable, Better Results
CVPR 2019
An Empirical Study of Spatial Attention Mechanisms in Deep Networks
ICCV 2019
Towards High Performance Video Object Detection
CVPR 2018
Flow-Guided Feature Aggregation for Video Object Detection
ICCV 2017
Deep Feature Flow for Video Recognition
CVPR 2017