Weidi Xie

57 papers · 2018–2026 · 10 conferences · across top CS/AI conferences

Achievements

+13 more ↓

🌍 Conference Polyglot (10) 🏃 Academic Marathon (7) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (5)

🌈 Renaissance Researcher (6) 🐣 Hot Topic Early Bird 🌍 Conference Polyglot (10) 🤝 Dynamic Duo (21) 🏆 Grand Slam 🔬 Deep Specialist (16) 🧬 Topic Evolution 🗃️ Keyword Collector (207) 📈 Trend Setter ⚡ Prolific Year (12) 🚀 Conference Pioneer 🔥 Unstoppable (6) 💎 Century Club (56)

Conferences

CVPR (18) ICCV (11) ECCV (10) NIPS (6) ICLR (4) EMNLP (3) AAAI (2) ICML (1) MICCAI (1) WACV (1)

Top co-authors

Andrew Zisserman (21) Ya Zhang (12) Yanfeng Wang (12) Tengda Han (9) Yao Hu (6) Xiaolong Jiang (6) Jilan Xu (5) Qirui Chen (5) Arsha Nagrani (5) Yifei Huang (4)

Keywords

video understanding (11) semantic segmentation (6) self-supervised learning (6) vision-language model (6) multimodal learning (5) zero-shot learning (5) video question answering (4) synthetic datum (4) video segmentation (3) image segmentation (3) depth estimation (3) egocentric video (3) instruction tuning (3) multi-modal learning (3) text generation (3) synthetic data generation (3) audio description (3) diffusion model (3) contrastive learning (2) action recognition (2)

Papers

Versatile Vision-Language Model for 3D Computed Tomography AAAI 2026 Enhancing Video-LLM Reasoning via Agent-of-Thoughts Distillation CVPR 2025 RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining MICCAI 2025 A Sanity Check for AI-generated Image Detection ICLR 2025 EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos ICLR 2025 Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos AAAI 2025 Track-On: Transformer-based Online Point Tracking with Memory ICLR 2025 Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning ICLR 2025 Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation ICCV 2025 Object-centric Video Question Answering with Visual Grounding and Referring ICCV 2025 Learning Streaming Video Representation via Multitask Training ICCV 2025 MRGen: Segmentation Data Engine For Underrepresented MRI Modalities ICCV 2025 LamRA: Large Multimodal Model as Your Advanced Retrieval Assistant CVPR 2025 Towards Universal Soccer Video Understanding CVPR 2025 Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models CVPR 2024 A General Protocol to Probe Large Vision Models for 3D Physical Understanding NIPS 2024 Grounded Question-Answering in Long Egocentric Videos CVPR 2024 Retrieval-Augmented Egocentric Video Captioning CVPR 2024 InstaGen: Enhancing Object Detection by Training on Synthetic Dataset CVPR 2024 Amodal Ground Truth and Completion in the Wild CVPR 2024 AutoAD III: The Prequel - Back to the Pixels CVPR 2024 VISA: Reasoning Video Object Segmentation via Large Language Model ECCV 2024 Appearance-based Refinement for Object-Centric Motion Segmentation ECCV 2024 Knowledge-enhanced Visual-Language Pretraining for Computational Pathology ECCV 2024 Multi-Sentence Grounding for Long-term Instructional Video ECCV 2024 Made to Order: Discovering monotonic temporal changes via self-supervised video ordering ECCV 2024 MatchTime: Towards Automatic Soccer Game Commentary Generation EMNLP 2024 RaTEScore: A Metric for Radiology Report Generation EMNLP 2024 EchoSight: Advancing Visual-Language Models with Wiki Knowledge EMNLP 2024 Annotation-Free Audio-Visual Segmentation WACV 2024 The Making and Breaking of Camouflage ICCV 2023 Towards Open-Vocabulary Video Instance Segmentation ICCV 2023 Open-vocabulary Object Segmentation with Diffusion Models ICCV 2023 Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision CVPR 2023 Collaboration Helps Camera Overtake LiDAR in 3D Detection CVPR 2023 Multi-Modal Classifiers for Open-Vocabulary Object Detection ICML 2023 OvarNet: Towards Open-Vocabulary Object Attribute Recognition CVPR 2023 Self-supervised Object-Centric Learning for Videos NIPS 2023 AutoAD: Movie Description in Context CVPR 2023 MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis ICCV 2023 AutoAD II: The Sequel - Who, When, and What in Movie Audio Description ICCV 2023 Joint-Relation Transformer for Multi-Person Motion Prediction ICCV 2023 Prompting Visual-Language Models for Efficient Video Understanding ECCV 2022 ReCo: Retrieve and Co-segment for Zero-shot Transfer NIPS 2022 Segmenting Moving Objects via an Object-Centric Layered Representation NIPS 2022 Associating Objects and Their Effects in Video through Coordination Games NIPS 2022 Label, Verify, Correct: A Simple Few Shot Object Detection Method CVPR 2022 It's About Time: Analog Clock Reading in the Wild CVPR 2022 PromptDet: Towards Open-Vocabulary Detection Using Uncurated Images ECCV 2022 Temporal Alignment Networks for Long-Term Video CVPR 2022 Self-Supervised Video Object Segmentation by Motion Grouping ICCV 2021 Localizing Visual Sounds the Hard Way CVPR 2021 Smooth-AP: Smoothing the Path Towards Large-Scale Image Retrieval ECCV 2020 Self-supervised Co-Training for Video Representation Learning NIPS 2020 Memory-augmented Dense Predictive Coding for Video Representation Learning ECCV 2020 MAST: A Memory-Augmented Self-Supervised Tracker CVPR 2020 Comparator Networks ECCV 2018