Xiaoyi Dong

46 papers · 2019–2026 · 8 conferences · across top CS/AI conferences

Achievements

+12 more ↓

🧭 Keyword Pioneer 🗺️ Taxonomy Completionist (10) 🌈 Renaissance Researcher (5) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (8)

🗺️ Taxonomy Completionist (10) 🧭 Keyword Pioneer 🏃 Academic Marathon (6) 🏆 Grand Slam 🤝 Dynamic Duo (28) 👥 Mega-Team (24) 🔬 Deep Specialist (14) 🧬 Topic Evolution 🗃️ Keyword Collector (202) 💎 Century Club (45) ❓ The Questioner (3) ⚡ Prolific Year (10)

Conferences

CVPR (14) ICCV (10) NIPS (7) ACL (4) ECCV (4) ICML (3) AAAI (2) ICLR (2)

Top co-authors

Jiaqi Wang (29) Pan Zhang (27) Yuhang Zang (25) Dahua Lin (21) Weiming Zhang (15) Yuhang Cao (15) Nenghai Yu (15) Dongdong Chen (14) Haodong Duan (10) Conghui He (9)

Keywords

vision-language model (7) multimodal learning (6) adversarial attack (6) vision transformer (5) large language model (4) video understanding (4) adversarial perturbation (3) image classification (3) object detection (3) adversarial sample (3) large vision-language model (3) point cloud (3) multi-modal learning (3) multimodal large language model (3) transfer learning (2) diffusion model (2) instruction tuning (2) vision language model (2) few-shot learning (2) reinforcement learning (2)

Papers

MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing ACL 2026 OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? CVPR 2025 Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction CVPR 2025 ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way CVPR 2025 Conical Visual Concentration for Efficient Large Vision-Language Models CVPR 2025 Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate ICCV 2025 Visual-RFT: Visual Reinforcement Fine-Tuning ICCV 2025 MM-IFEngine: Towards Multimodal Instruction Following ICCV 2025 SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition ACL 2025 InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model ACL 2025 Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings ACL 2025 VideoRoPE: What Makes for Good Video Rotary Position Embedding? ICML 2025 SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation ICML 2025 Maximum Entropy Reinforcement Learning with Diffusion Policy ICML 2025 MotionClone: Training-Free Motion Cloning for Controllable Video Generation ICLR 2025 MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models ICLR 2025 Light-A-Video: Training-free Video Relighting via Progressive Light Fusion ICCV 2025 SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree ICCV 2025 Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data ICCV 2025 X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting ICCV 2025 ShareGPT4V: Improving Large Multi-Modal Models with Better Captions ECCV 2024 MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs NIPS 2024 ShareGPT4Video: Improving Video Understanding and Generation with Better Captions NIPS 2024 Are We on the Right Way for Evaluating Large Vision-Language Models? NIPS 2024 InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD NIPS 2024 MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations NIPS 2024 Streaming Long Video Understanding with Large Language Models NIPS 2024 VIGC: Visual Instruction Generation and Correction AAAI 2024 OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation CVPR 2024 Long-CLIP: Unlocking the Long-Text Capability of CLIP ECCV 2024 Emotional Listener Portrait: Neural Listener Head Generation with Emotion ICCV 2023 MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining CVPR 2023 Diversity-Aware Meta Visual Prompting CVPR 2023 PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers AAAI 2023 Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting ICCV 2023 Protecting Celebrities From DeepFake With Identity Consistency Transformer CVPR 2022 Adaptive Face Forgery Detection in Cross Domain ECCV 2022 Bootstrapped Masked Autoencoders for Vision BERT Pretraining ECCV 2022 Shape-Invariant 3D Adversarial Point Clouds CVPR 2022 CSWin Transformer: A General Vision Transformer Backbone With Cross-Shaped Windows CVPR 2022 Mobile-Former: Bridging MobileNet and Transformer CVPR 2022 GreedyFool: Distortion-Aware Sparse Adversarial Attack NIPS 2020 Self-Robust 3D Point Recognition via Gather-Vector Guidance CVPR 2020 Robust Superpixel-Guided Attentional Adversarial Attack CVPR 2020 LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud Based Deep Networks CVPR 2020 Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once ICCV 2019