Mike Zheng Shou

100 papers · 2021–2026 · 10 conferences · across top CS/AI conferences

Achievements

+14 more ↓

🧭 Keyword Pioneer 🗺️ Taxonomy Completionist (16) 🌈 Renaissance Researcher (5) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (10)

🗺️ Taxonomy Completionist (16) 🧭 Keyword Pioneer 🐝 Cross-Pollinator (7) 🏠 Conference Loyalist (32) 🏆 Grand Slam 🤝 Dynamic Duo (20) 👥 Mega-Team (100) 🔬 Deep Specialist (32) 🏆 Keyword Champion (2) ⚡ Prolific Year (33) 🗃️ Keyword Collector (414) 🔥 Unstoppable (5) ❓ The Questioner 💎 Century Club (99)

Conferences

CVPR (32) ICCV (19) NIPS (18) ECCV (10) ICLR (7) AAAI (5) EMNLP (3) ACL (2) ICML (2) IJCAI (2)

Top co-authors

Difei Gao (20) Kevin Qinghong Lin (18) Jia-Wei Liu (13) David Junhao Zhang (11) Yuchao Gu (11) Joya Chen (10) Ying Shan (10) Jay Zhangjie Wu (9) Rui Zhao (9) Zechen Bai (9)

Research topics

Privacy (1)

Keywords

diffusion model (14) multimodal learning (14) video understanding (10) image generation (7) vision transformer (7) large language model (6) transfer learning (6) video generation (6) contrastive learning (5) multi-modal learning (5) generative model (4) vision-language model (4) action recognition (4) neural radiance field (4) object detection (3) benchmark evaluation (3) text-to-image generation (3) domain adaptation (3) video retrieval (3) knowledge distillation (3)

Papers

OptMark: Robust Multi-bit Diffusion Watermarking via Inference Time Optimization AAAI 2026 InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models with Human Feedback EMNLP 2025 SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost CVPR 2025 ShowUI: One Vision-Language-Action Model for GUI Visual Agent CVPR 2025 ROICtrl: Boosting Instance Control for Visual Generation CVPR 2025 MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation CVPR 2025 IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation CVPR 2025 DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models CVPR 2025 VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary CVPR 2025 MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation ICLR 2025 WMAdapter: Adding WaterMark Control to Latent Diffusion Models ICML 2025 Impossible Videos ICML 2025 Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach ICLR 2025 Grounding Multimodal Large Language Model in GUI World ICLR 2025 Show-o: One Single Transformer to Unify Multimodal Understanding and Generation ICLR 2025 Image Watermarks are Removable using Controllable Regeneration from Clean Noise ICLR 2025 Factorized Learning for Temporally Grounded Video-Language Models ICCV 2025 VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting AAAI 2025 Balanced Image Stylization with Style Matching Score ICCV 2025 PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning ACL 2025 ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning CVPR 2025 DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles CVPR 2025 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale CVPR 2025 DiffSim: Taming Diffusion Models for Evaluating Visual Similarity ICCV 2025 LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer ICCV 2025 Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images ECCV 2024 One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos NIPS 2024 Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning NIPS 2024 Visual Perception by Large Language Model’s Weights NIPS 2024 Can Simple Averaging Defeat Modern Watermarks? NIPS 2024 VideoGUI: A Benchmark for GUI Automation from Instructional Videos NIPS 2024 DoFIT: Domain-aware Federated Instruction Tuning with Alleviated Catastrophic Forgetting NIPS 2024 VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation NIPS 2024 LOVA3: Learning to Visual Question Answering, Asking and Assessment NIPS 2024 EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models NIPS 2024 Skinned Motion Retargeting with Dense Geometric Interaction Perception NIPS 2024 Exocentric-to-Egocentric Video Generation NIPS 2024 VideoLLM-online: Online Video Large Language Model for Streaming Video CVPR 2024 L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream CVPR 2024 Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis CVPR 2024 Bootstrapping SparseFormers from Vision Foundation Models CVPR 2024 X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model CVPR 2024 VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence CVPR 2024 Tune-An-Ellipse: CLIP Has Potential to Find What You Want CVPR 2024 Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives CVPR 2024 MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model CVPR 2024 ViT-Lens: Towards Omni-modal Representations CVPR 2024 AssistGUI: Task-Oriented PC Graphical User Interface Automation CVPR 2024 DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing CVPR 2024 Drag Anything: Motion Control for Anything using Entity Representation ECCV 2024 GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator ECCV 2024 Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification ECCV 2024 Parrot Captions Teach CLIP to Spot Text ECCV 2024 Learning Video Context as Interleaved Multimodal Sequences ECCV 2024 MotionDirector: Motion Customization of Text-to-Video Diffusion Models ECCV 2024 SparseFormer: Sparse Visual Recognition via Limited Latent Tokens ICLR 2024 Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition IJCAI 2024 Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces IJCAI 2024 CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding ACL 2023 Darwinian Model Upgrades: Model Evolving with Selective Compatibility AAAI 2023 PV3D: A 3D Generative Model for Portrait Video Generation ICLR 2023 XAGen: 3D Expressive Human Avatars Generation NIPS 2023 Video-Text Pre-training with Learned Regions for Retrieval AAAI 2023 Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval CVPR 2023 Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models NIPS 2023 Object-centric Learning with Cyclic Walks between Parts and Whole NIPS 2023 Position-Guided Text Prompt for Vision-Language Pre-Training CVPR 2023 Learning Visual Prior via Generative Pre-Training NIPS 2023 DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models NIPS 2023 All in One: Exploring Unified Video-Language Pre-Training CVPR 2023 Making Vision Transformers Efficient From a Token Sparsification View CVPR 2023 Affordance Grounding From Demonstration Video To Target Image CVPR 2023 DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models ICCV 2023 Too Large; Data Reduction for Vision-Language Pre-Training ICCV 2023 STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition ICCV 2023 Unsupervised Open-Vocabulary Object Localization in Videos ICCV 2023 Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation ICCV 2023 Revisiting Vision Transformer from the View of Path Ensemble ICCV 2023 Learning to Learn: How to Continuously Teach Humans and Machines ICCV 2023 UniVTG: Towards Unified Video-Language Temporal Grounding ICCV 2023 EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone ICCV 2023 BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion ICCV 2023 HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video ICCV 2023 Label-Efficient Online Continual Object Detection in Streaming Video ICCV 2023 Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task AAAI 2023 MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering CVPR 2023 "GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval" ECCV 2022 Egocentric Video-Language Pretraining NIPS 2022 AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant EMNLP 2022 Unified Transformer Tracker for Object Tracking CVPR 2022 Object-Aware Video-Language Pre-Training for Retrieval CVPR 2022 DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes NIPS 2022 Ego4D: Around the World in 3,000 Hours of Egocentric Video CVPR 2022 MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning ECCV 2022 AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant ECCV 2022 Channel Augmented Joint Learning for Visible-Infrared Recognition ICCV 2021 On Pursuit of Designing Multi-modal Transformer for Video Grounding EMNLP 2021 Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization CVPR 2021 Generic Event Boundary Detection: A Benchmark for Event Segmentation ICCV 2021 Searching for Two-Stream Models in Multivariate Space for Video Recognition ICCV 2021