Renrui Zhang

65 papers · 2021–2026 · 12 conferences · across top CS/AI conferences

Achievements

+13 more ↓

🌍 Conference Polyglot (12) 🏃 Academic Marathon (5) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🐝 Cross-Pollinator (5)

🐝 Cross-Pollinator (5) 🌈 Renaissance Researcher (6) 🗺️ Taxonomy Completionist (86) 🔬 Deep Specialist (16) 👥 Mega-Team (22) 🤝 Dynamic Duo (31) 👑 Triple Crown 🏆 Grand Slam ⚡ Prolific Year (5) 🔥 Unstoppable (6) ❓ The Questioner 🗃️ Keyword Collector (223) 💎 Century Club (63)

Conferences

CVPR (17) AAAI (9) ICCV (9) ICLR (9) ECCV (6) NIPS (5) ICML (4) WACV (2) ACL (1) CORL (1) EMNLP (1) IJCAI (1)

Top co-authors

peng gao (32) hongsheng Li (27) Ziyu Guo (18) Yu Qiao (15) Shanghang Zhang (15) Jiaming Liu (12) Aojun Zhou (9) Dongzhi Jiang (7) Yandong Guo (6) Longtian Qiu (5)

Keywords

point cloud (12) multimodal learning (6) masked autoencoder (6) few-shot learning (5) 3d vision (5) self-supervised learning (5) domain adaptation (5) zero-shot learning (5) multi-modal learning (5) foundation model (5) large language model (4) contrastive learning (4) 3d object detection (4) autonomous driving (3) transfer learning (3) robotic manipulation (3) model compression (3) object detection (3) multimodal large language model (3) continual learning (2)

Papers

TIDE: Temporal-Aware Sparse Autoencoders for Interpretable Diffusion Transformers in Image Generation AAAI 2026 NL2CA: Auto-formalizing Cognitive Decision-Making from Natural Language Using an Unsupervised CriticNL2LTL Framework AAAI 2026 PiSA: A Self-Augmented Data Engine and Training Strategy for 3D Understanding with Large Models WACV 2026 3DS-VLA: A 3D Spatial-Aware Vision Language Action Model for Robust Multi-Task Manipulation CORL 2025 Let's Verify and Reinforce Image Generation Step by Step CVPR 2025 Lift3D Policy: Lifting 2D Foundation Models for Robust 3D Robotic Manipulation CVPR 2025 Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis CVPR 2025 MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency ICML 2025 MAVIS: Mathematical Visual Instruction Tuning with an Automatic Data Engine ICLR 2025 Lumina-T2X: Scalable Flow-based Large Diffusion Transformer for Flexible Resolution Generation ICLR 2025 PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions ICLR 2025 LLaVA-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models ICLR 2025 From Reflection to Perfection: Scaling Inference-Time Optimization for Text-to-Image Diffusion Models via Reflection Tuning ICCV 2025 MM-Mixing: Multi-Modal Mixing Alignment for 3D Understanding AAAI 2025 MMSearch: Unveiling the Potential of Large Models as Multi-modal Search Engines ICLR 2025 LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding AAAI 2025 Chimera: Improving Generalist Model with Domain-Specific Experts ICCV 2025 TAR3D: Creating High-Quality 3D Assets via Next-Part Prediction ICCV 2025 Detect Anything 3D in the Wild ICCV 2025 SciVerse: Unveiling the Knowledge Comprehension and Visual Reasoning of LMMs on Multi-modal Scientific Problems ACL 2025 Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs ICCV 2025 Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models EMNLP 2024 SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models ICML 2024 FM-OV3D: Foundation Model-Based Cross-Modal Knowledge Blending for Open-Vocabulary 3D Detection AAAI 2024 MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical Reasoning ICLR 2024 LLaMA-Adapter: Efficient Fine-tuning of Large Language Models with Zero-initialized Attention ICLR 2024 Personalize Segment Anything Model with One Shot ICLR 2024 ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation ICLR 2024 SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models ICML 2024 CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching NIPS 2024 RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation NIPS 2024 Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation AAAI 2024 Parsing All Adverse Scenes: Severity-Aware Semantic Segmentation with Mask-Enhanced Cross-Domain Consistency AAAI 2024 Gradient-based Parameter Selection for Efficient Fine-Tuning CVPR 2024 ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation CVPR 2024 No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation CVPR 2024 Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation CVPR 2024 NTO3D: Neural Target Object 3D Reconstruction with Segment Anything CVPR 2024 OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning CVPR 2024 Cloud-Device Collaborative Learning for Multimodal Large Language Models CVPR 2024 MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI ICML 2024 MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? ECCV 2024 PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation ECCV 2024 "SPHINX: A Mixer of Weights, Visual Embeddings and Image Scales for Multi-modal Large Language Models" ECCV 2024 Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement ICCV 2023 JourneyDB: A Benchmark for Generative Image Understanding NIPS 2023 CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention AAAI 2023 Decorate the Newcomers: Visual Domain Prompt for Continual Test Time Adaptation AAAI 2023 Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners CVPR 2023 Starting From Non-Parametric Networks for 3D Point Cloud Analysis CVPR 2023 Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders CVPR 2023 iQuery: Instruments As Queries for Audio-Visual Sound Separation CVPR 2023 EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding CVPR 2023 PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection CVPR 2023 MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection ICCV 2023 PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning ICCV 2023 SparseMAE: Sparse Training Meets Masked Autoencoders ICCV 2023 Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud Pre-training IJCAI 2023 Nearest Neighbors Meet Deep Neural Networks for Point Cloud Analysis WACV 2023 Frozen CLIP Models Are Efficient Video Learners ECCV 2022 PointCLIP: Point Cloud Understanding by CLIP CVPR 2022 Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training NIPS 2022 Exploring Resolution and Degradation Clues As Self-Supervised Signal for Low Quality Object Detection ECCV 2022 Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification ECCV 2022 Dual-stream Network for Visual Recognition NIPS 2021