Xiyang Dai

38 papers · 2017–2026 · 8 conferences · across top CS/AI conferences

Achievements

+12 more ↓

🐝 Cross-Pollinator (12) 🏃 Academic Marathon (8) 🧭 Keyword Pioneer 🌍 Conference Polyglot (7) 🌈 Renaissance Researcher (6)

🌈 Renaissance Researcher (6) 🌉 Interdisciplinary Bridge 🗺️ Taxonomy Completionist (71) 🤝 Dynamic Duo (29) 🏆 Grand Slam 🏆 Keyword Champion (2) 🔬 Deep Specialist (12) 🗃️ Keyword Collector (173) 💎 Century Club (37) 🔥 Unstoppable (7) ❓ The Questioner ⚡ Prolific Year (8)

Conferences

CVPR (15) NIPS (7) ICCV (5) ICLR (4) ECCV (3) EMNLP (2) AAAI (1) ICML (1)

Top co-authors

Lu Yuan (29) Dongdong Chen (19) Yinpeng Chen (18) Mengchen Liu (17) Jianwei Yang (12) Zicheng Liu (9) Jianfeng Gao (8) Pengchuan Zhang (7) Bin Xiao (7) Yu-Gang Jiang (7)

Keywords

object detection (10) vision transformer (6) convolutional neural network (5) attention mechanism (4) zero-shot learning (4) image classification (4) semantic segmentation (4) video understanding (3) efficient computing (3) contrastive learning (3) semantic alignment (2) self-attention mechanism (2) image captioning (2) transfer learning (2) multimodal learning (2) vision-language alignment (2) self-supervised learning (2) visual question answering (2) weakly supervised learning (2) model architecture (2)

Papers

LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation AAAI 2026 ProLongVid: A Simple but Strong Baseline for Long-context Video Instruction Tuning EMNLP 2025 Exploring Invariance in Images through One-way Wave Equations ICML 2025 DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs NIPS 2024 Efficient Modulation for Vision Networks ICLR 2024 Rewrite the Stars CVPR 2024 Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks CVPR 2024 Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding CVPR 2023 Learning from Rich Semantics and Coarse Locations for Long-tailed Object Detection NIPS 2023 Look Before You Match: Instance Understanding Matters in Video Object Segmentation CVPR 2023 Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning CVPR 2023 Generalized Decoding for Pixel, Image, and Language CVPR 2023 LACMA: Language-Aligning Contrastive Learning with Meta-Actions for Embodied Instruction Following EMNLP 2023 Layer Grafted Pre-training: Bridging Contrastive Learning And Masked Image Modeling For Label-Efficient Representations ICLR 2023 RegionCLIP: Region-Based Language-Image Pretraining CVPR 2022 BEVT: BERT Pretraining of Video Transformers CVPR 2022 Efficient Self-supervised Vision Transformers for Representation Learning ICLR 2022 Visual Clues: Bridging Vision and Language Foundations for Image Paragraph Captioning NIPS 2022 Should All Proposals Be Treated Equally in Object Detection? ECCV 2022 Focal Modulation Networks NIPS 2022 GLIPv2: Unifying Localization and Vision-Language Understanding NIPS 2022 Mobile-Former: Bridging MobileNet and Transformer CVPR 2022 Reduce Information Loss in Transformers for Pluralistic Image Inpainting CVPR 2022 MicroNet: Improving Image Recognition With Extremely Low FLOPs ICCV 2021 Focal Attention for Long-Range Interactions in Vision Transformers NIPS 2021 Revisiting Dynamic Convolution via Matrix Decomposition ICLR 2021 Dynamic Head: Unifying Object Detection Heads With Attentions CVPR 2021 Stronger NAS with Weaker Predictors NIPS 2021 Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding ICCV 2021 Dynamic DETR: End-to-End Object Detection With Dynamic Attention ICCV 2021 CvT: Introducing Convolutions to Vision Transformers ICCV 2021 Dynamic ReLU ECCV 2020 METAL: Minimum Effort Temporal Activity Localization in Untrimmed Videos CVPR 2020 Dynamic Convolution: Attention Over Convolution Kernels CVPR 2020 DA-NAS: Data Adapted Pruning for Efficient Neural Architecture Search ECCV 2020 MAN: Moment Alignment Network for Natural Language Moment Retrieval via Iterative Graph Adjustment CVPR 2019 FASON: First and Second Order Information Fusion Network for Texture Recognition CVPR 2017 Temporal Context Network for Activity Localization in Videos ICCV 2017