Jianwei Yang

52 papers · 2016–2025 · 10 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🌍 Conference Polyglot (10) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (5) 🏃 Academic Marathon (9)

🌉 Interdisciplinary Bridge 🏃 Academic Marathon (9) 🧭 Keyword Pioneer 🌟 Keyword Trendsetter Combo (4) 🤝 Dynamic Duo (29) 🔬 Deep Specialist (20) 🧬 Topic Evolution 🏆 Keyword Champion (5) 🏆 Grand Slam ⚡ Prolific Year (7) ❓ The Questioner 🗃️ Keyword Collector (188) 💎 Century Club (52) 📈 Trend Setter 🔥 Unstoppable (5) 🚀 Conference Pioneer

Conferences

CVPR (14) NIPS (13) ICCV (7) ECCV (6) ICLR (5) EMNLP (2) ICML (2) AAAI (1) CORL (1) MICCAI (1)

Top co-authors

Jianfeng Gao (29) Chunyuan Li (19) Lu Yuan (12) Xiyang Dai (12) Pengchuan Zhang (11) Lei Zhang (10) Feng Li (8) Xueyan Zou (8) Devi Parikh (8) Hao Zhang (8)

Keywords

object detection (13) vision-language model (10) transfer learning (9) multimodal learning (8) convolutional neural network (5) semantic segmentation (5) few-shot learning (5) image segmentation (5) open-vocabulary segmentation (5) zero-shot learning (5) contrastive learning (5) vision transformer (5) image classification (4) representation learning (3) multi-modal learning (3) visual question answering (3) multimodal large language model (3) visual representation (2) visual grounding (2) knowledge distillation (2)

Papers

Simplifying DINO via Coding Rate Regularization ICML 2025 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding ICML 2025 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion CVPR 2025 Is Your World Simulator a Good Story Presenter? A Consecutive Events-Based Benchmark for Future Long Video Generation CVPR 2025 TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies ICLR 2025 Latent Action Pretraining from Videos ICLR 2025 Matryoshka Multimodal Models ICLR 2025 SITE: towards Spatial Intelligence Thorough Evaluation ICCV 2025 ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning EMNLP 2025 Magma: A Foundation Model for Multimodal AI Agents CVPR 2025 Structure-Aware Cross-Modal Prompt Tuning for Autonomous Bronchoscopic Navigation MICCAI 2025 Efficient Modulation for Vision Networks ICLR 2024 DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs NIPS 2024 Interfacing Foundation Models' Embeddings NIPS 2024 Towards Flexible Visual Relationship Segmentation NIPS 2024 VCoder: Versatile Vision Encoders for Multimodal Large Language Models CVPR 2024 Visual In-Context Prompting CVPR 2024 LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models ECCV 2024 Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection ECCV 2024 LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents ECCV 2024 Segment and Recognize Anything at Any Granularity ECCV 2024 Pix2Gif: Motion-Guided Diffusion for GIF Generation ECCV 2024 GLIGEN: Open-Set Grounded Text-to-Image Generation CVPR 2023 Generalized Decoding for Pixel, Image, and Language CVPR 2023 Parameter-Efficient Model Adaptation for Vision Transformers AAAI 2023 Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection NIPS 2023 LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following EMNLP 2023 LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day NIPS 2023 A Simple Framework for Open-Vocabulary Segmentation and Detection ICCV 2023 Segment Everything Everywhere All at Once NIPS 2023 Learning Customized Visual Models With Retrieval-Augmented Knowledge CVPR 2023 Focal Modulation Networks NIPS 2022 Grounded Language-Image Pre-Training CVPR 2022 RegionCLIP: Region-Based Language-Image Pretraining CVPR 2022 Unified Contrastive Learning in Image-Text-Label Space CVPR 2022 Efficient Self-supervised Vision Transformers for Representation Learning ICLR 2022 K-LITE: Learning Transferable Visual Models with External Knowledge NIPS 2022 ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models NIPS 2022 Focal Attention for Long-Range Interactions in Vision Transformers NIPS 2021 TACo: Token-Aware Cascade Contrastive Learning for Video-Text Alignment ICCV 2021 Dynamic DETR: End-to-End Object Detection With Dynamic Attention ICCV 2021 Learning To Generate Scene Graph From Natural Language Supervision ICCV 2021 Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding ICCV 2021 VinVL: Revisiting Visual Representations in Vision-Language Models CVPR 2021 Embodied Amodal Recognition: Learning to Move to Perceive Objects ICCV 2019 Cross-channel Communication Networks NIPS 2019 Neural Baby Talk CVPR 2018 Visual Curiosity: Learning to Ask Questions to Learn Visual Recognition CORL 2018 Graph R-CNN for Scene Graph Generation ECCV 2018 Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model NIPS 2017 Joint Unsupervised Learning of Deep Representations and Image Clusters CVPR 2016 Hierarchical Question-Image Co-Attention for Visual Question Answering NIPS 2016