Jifeng Dai

89 papers · 2013–2026 · 7 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🗺️ Taxonomy Completionist (11) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (6) 🌍 Conference Polyglot (6)

🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (6) 🗺️ Taxonomy Completionist (11) 🏠 Conference Loyalist (34) 🤝 Dynamic Duo (46) 🏆 Grand Slam 🔬 Deep Specialist (16) 🧬 Topic Evolution 👥 Mega-Team (38) 👑 Triple Crown ⚡ Prolific Year (14) 📈 Trend Setter 🚀 Conference Pioneer 🔥 Unstoppable (13) 💎 Century Club (88) 🗃️ Keyword Collector (299)

Conferences

CVPR (34) NIPS (15) ICCV (14) ICLR (12) ECCV (10) ICML (3) AAAI (1)

Top co-authors

Xizhou Zhu (46) Yu Qiao (40) Wenhai Wang (28) Lewei Lu (24) hongsheng Li (22) Zhe Chen (15) Xiaogang Wang (13) Hao Li (12) Tong Lu (11) Gao Huang (11)

Keywords

object detection (14) semantic segmentation (10) vision-language model (9) convolutional neural network (9) multimodal large language model (6) multimodal learning (5) foundation model (5) multi-task learning (5) multi-modal learning (5) large language model (5) self-supervised learning (4) visual representation (4) instance segmentation (4) deformable convolution (4) representation learning (3) optical flow (3) weakly supervised learning (3) transfer learning (3) image generation (3) visual question answering (3)

Papers

Grounding Actions in Camera Space: Observation-Centric Vision-Language-Action Policy AAAI 2026 SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding CVPR 2025 MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism CVPR 2025 HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding CVPR 2025 Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy ICCV 2025 PUMA: Empowering Unified MLLM with Multi-granular Visual Generation ICCV 2025 LangBridge: Interpreting Image as a Combination of Language Embeddings ICCV 2025 V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding ICCV 2025 Docopilot: Improving Multimodal Models for Document-Level Understanding CVPR 2025 Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training CVPR 2025 PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models CVPR 2025 MuLan: Adapting Multilingual Diffusion Models for Hundreds of Languages with Negligible Cost ICML 2025 CoMemo: LVLMs Need Image Context with Image Memory ICML 2025 MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models ICLR 2025 Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures ICLR 2025 OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text ICLR 2025 Maintaining Structural Integrity in Parameter Spaces for Parameter Efficient Fine-tuning ICLR 2025 Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning NIPS 2024 InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD NIPS 2024 Learning 1D Causal Visual Representation with De-focus Attention Networks NIPS 2024 DI-MaskDINO: A Joint Object Detection and Instance Segmentation Model NIPS 2024 VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks NIPS 2024 InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks CVPR 2024 Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications CVPR 2024 Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft CVPR 2024 Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision CVPR 2024 CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics NIPS 2024 Parameter-Inverted Image Pyramid Networks NIPS 2024 RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis ICML 2024 ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process ICLR 2024 The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World ICLR 2024 Bounding Box Stability against Feature Dropout Reflects Detector Generalization across Environments ICLR 2024 Distilling Knowledge from Large-Scale Image Models for Object Detection ECCV 2024 The All-Seeing Project V2: Towards General Relation Comprehension of the Open World ECCV 2024 ControlLLM: Augment Language Models with Tools by Searching on Graphs ECCV 2024 Needle In A Multimodal Haystack NIPS 2024 Vision Transformer Adapter for Dense Predictions ICLR 2023 EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought NIPS 2023 JourneyDB: A Benchmark for Generative Image Understanding NIPS 2023 VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks NIPS 2023 Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information CVPR 2023 Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks CVPR 2023 BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision CVPR 2023 FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation CVPR 2023 Siamese Image Modeling for Self-Supervised Vision Representation Learning CVPR 2023 Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior CVPR 2023 InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions CVPR 2023 Planning-Oriented Autonomous Driving CVPR 2023 Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions CVPR 2023 VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation ICCV 2023 MCMAE: Masked Convolution Meets Masked Autoencoders NIPS 2022 BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers ECCV 2022 FlowFormer: A Transformer Architecture for Optical Flow ECCV 2022 VL-LTR: Learning Class-Wise Visual-Linguistic Representation for Long-Tailed Visual Recognition ECCV 2022 Frozen CLIP Models Are Efficient Video Learners ECCV 2022 Tip-Adapter: Training-Free Adaption of CLIP for Few-Shot Classification ECCV 2022 Exploring the Equivalence of Siamese Self-Supervised Learning via a Unified Gradient Framework CVPR 2022 AutoLoss-Zero: Searching Loss Functions From Scratch for Generic Tasks CVPR 2022 Uni-Perceiver: Pre-Training Unified Architecture for Generic Perception for Zero-Shot and Few-Shot Tasks CVPR 2022 Uni-Perceiver-MoE: Learning Sparse Generalist Models with Conditional MoEs NIPS 2022 Fast Convergence of DETR With Spatially Modulated Co-Attention ICCV 2021 Deformable DETR: Deformable Transformers for End-to-End Object Detection ICLR 2021 Auto Seg-Loss: Searching Metric Surrogates for Semantic Segmentation ICLR 2021 Unsupervised Object Detection With LIDAR Clues CVPR 2021 Influence Selection for Active Learning ICCV 2021 FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting ICCV 2021 Exploring Cross-Image Pixel Contrast for Semantic Segmentation ICCV 2021 Searching Parameterized AP Loss for Object Detection NIPS 2021 Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation ECCV 2020 Hierarchical Human Parsing With Typed Part-Relation Reasoning CVPR 2020 Resolution Adaptive Networks for Efficient Inference CVPR 2020 Deformable Kernels: Adapting Effective Receptive Fields for Object Deformation ICLR 2020 VL-BERT: Pre-training of Generic Visual-Linguistic Representations ICLR 2020 An Empirical Study of Spatial Attention Mechanisms in Deep Networks ICCV 2019 Deformable ConvNets V2: More Deformable, Better Results CVPR 2019 Towards High Performance Video Object Detection CVPR 2018 Relation Networks for Object Detection CVPR 2018 Learning Region Features for Object Detection ECCV 2018 Deep Feature Flow for Video Recognition CVPR 2017 Deformable Convolutional Networks ICCV 2017 Flow-Guided Feature Aggregation for Video Object Detection ICCV 2017 Fully Convolutional Instance-Aware Semantic Segmentation CVPR 2017 R-FCN: Object Detection via Region-based Fully Convolutional Networks NIPS 2016 Instance-Aware Semantic Segmentation via Multi-Task Network Cascades CVPR 2016 ScribbleSup: Scribble-Supervised Convolutional Networks for Semantic Segmentation CVPR 2016 Convolutional Feature Masking for Joint Object and Stuff Segmentation CVPR 2015 BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation ICCV 2015 Unsupervised Learning of Dictionaries of Hierarchical Compositional Models CVPR 2014 Cosegmentation and Cosketch by Unsupervised Learning ICCV 2013