Xiaoyi Dong
46 papers · 2019–2026 · 8 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+12 more ↓ Show less ↑
π§ Keyword Pioneer πΊοΈ Taxonomy Completionist (10) π Renaissance Researcher (5) π Interdisciplinary Bridge π Conference Polyglot (8)
πΊοΈ
Taxonomy Completionist
(10)
π§
Keyword Pioneer
π
Academic Marathon
(6)
π
Grand Slam
π€
Dynamic Duo
(28)
π₯
Mega-Team
(24)
π¬
Deep Specialist
(14)
π§¬
Topic Evolution
ποΈ
Keyword Collector
(202)
π
Century Club
(45)
β
The Questioner
(3)
β‘
Prolific Year
(10)
Conferences
CVPR (14)
ICCV (10)
NIPS (7)
ACL (4)
ECCV (4)
ICML (3)
AAAI (2)
ICLR (2)
Top co-authors
Keywords
vision-language model
(7)
multimodal learning
(6)
adversarial attack
(6)
vision transformer
(5)
large language model
(4)
video understanding
(4)
adversarial perturbation
(3)
image classification
(3)
object detection
(3)
adversarial sample
(3)
large vision-language model
(3)
point cloud
(3)
multi-modal learning
(3)
multimodal large language model
(3)
transfer learning
(2)
diffusion model
(2)
instruction tuning
(2)
vision language model
(2)
few-shot learning
(2)
reinforcement learning
(2)
Papers
MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing
ACL 2026
OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?
CVPR 2025
Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction
CVPR 2025
ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way
CVPR 2025
Conical Visual Concentration for Efficient Large Vision-Language Models
CVPR 2025
Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate
ICCV 2025
Visual-RFT: Visual Reinforcement Fine-Tuning
ICCV 2025
MM-IFEngine: Towards Multimodal Instruction Following
ICCV 2025
SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition
ACL 2025
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model
ACL 2025
Towards Storage-Efficient Visual Document Retrieval: An Empirical Study on Reducing Patch-Level Embeddings
ACL 2025
VideoRoPE: What Makes for Good Video Rotary Position Embedding?
ICML 2025
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
ICML 2025
Maximum Entropy Reinforcement Learning with Diffusion Policy
ICML 2025
MotionClone: Training-Free Motion Cloning for Controllable Video Generation
ICLR 2025
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
ICLR 2025
Light-A-Video: Training-free Video Relighting via Progressive Light Fusion
ICCV 2025
SAM2Long: Enhancing SAM 2 for Long Video Segmentation with a Training-Free Memory Tree
ICCV 2025
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
ICCV 2025
X-Prompt: Generalizable Auto-Regressive Visual Learning with In-Context Prompting
ICCV 2025
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
ECCV 2024
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
NIPS 2024
ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
NIPS 2024
Are We on the Right Way for Evaluating Large Vision-Language Models?
NIPS 2024
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
NIPS 2024
MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations
NIPS 2024
Streaming Long Video Understanding with Large Language Models
NIPS 2024
VIGC: Visual Instruction Generation and Correction
AAAI 2024
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
CVPR 2024
Long-CLIP: Unlocking the Long-Text Capability of CLIP
ECCV 2024
Emotional Listener Portrait: Neural Listener Head Generation with Emotion
ICCV 2023
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
CVPR 2023
Diversity-Aware Meta Visual Prompting
CVPR 2023
PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers
AAAI 2023
Improving Adversarial Robustness of Masked Autoencoders via Test-time Frequency-domain Prompting
ICCV 2023
Protecting Celebrities From DeepFake With Identity Consistency Transformer
CVPR 2022
Adaptive Face Forgery Detection in Cross Domain
ECCV 2022
Bootstrapped Masked Autoencoders for Vision BERT Pretraining
ECCV 2022
Shape-Invariant 3D Adversarial Point Clouds
CVPR 2022
CSWin Transformer: A General Vision Transformer Backbone With Cross-Shaped Windows
CVPR 2022
Mobile-Former: Bridging MobileNet and Transformer
CVPR 2022
GreedyFool: Distortion-Aware Sparse Adversarial Attack
NIPS 2020
Self-Robust 3D Point Recognition via Gather-Vector Guidance
CVPR 2020
Robust Superpixel-Guided Attentional Adversarial Attack
CVPR 2020
LG-GAN: Label Guided Adversarial Network for Flexible Targeted Attack of Point Cloud Based Deep Networks
CVPR 2020
Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once
ICCV 2019