Mike Zheng Shou
100 papers · 2021–2026 · 10 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+14 more ↓ Show less ↑
π§ Keyword Pioneer πΊοΈ Taxonomy Completionist (16) π Renaissance Researcher (5) π Interdisciplinary Bridge π Conference Polyglot (10)
πΊοΈ
Taxonomy Completionist
(16)
π§
Keyword Pioneer
π
Cross-Pollinator
(7)
π
Conference Loyalist
(32)
π
Grand Slam
π€
Dynamic Duo
(20)
π₯
Mega-Team
(100)
π¬
Deep Specialist
(32)
π
Keyword Champion
(2)
β‘
Prolific Year
(33)
ποΈ
Keyword Collector
(414)
π₯
Unstoppable
(5)
β
The Questioner
π
Century Club
(99)
Conferences
CVPR (32)
ICCV (19)
NIPS (18)
ECCV (10)
ICLR (7)
AAAI (5)
EMNLP (3)
ACL (2)
ICML (2)
IJCAI (2)
Top co-authors
Research topics
Keywords
diffusion model
(14)
multimodal learning
(14)
video understanding
(10)
image generation
(7)
vision transformer
(7)
large language model
(6)
transfer learning
(6)
video generation
(6)
contrastive learning
(5)
multi-modal learning
(5)
generative model
(4)
vision-language model
(4)
action recognition
(4)
neural radiance field
(4)
object detection
(3)
benchmark evaluation
(3)
text-to-image generation
(3)
domain adaptation
(3)
video retrieval
(3)
knowledge distillation
(3)
Papers
OptMark: Robust Multi-bit Diffusion Watermarking via Inference Time Optimization
AAAI 2026
InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models with Human Feedback
EMNLP 2025
SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2% Training Cost
CVPR 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
ROICtrl: Boosting Instance Control for Visual Generation
CVPR 2025
MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation
CVPR 2025
IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation
CVPR 2025
DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models
CVPR 2025
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
CVPR 2025
MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation
ICLR 2025
WMAdapter: Adding WaterMark Control to Latent Diffusion Models
ICML 2025
Impossible Videos
ICML 2025
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach
ICLR 2025
Grounding Multimodal Large Language Model in GUI World
ICLR 2025
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
ICLR 2025
Image Watermarks are Removable using Controllable Regeneration from Clean Noise
ICLR 2025
Factorized Learning for Temporally Grounded Video-Language Models
ICCV 2025
VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting
AAAI 2025
Balanced Image Stylization with Style Matching Score
ICCV 2025
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
ACL 2025
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning
CVPR 2025
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
CVPR 2025
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
CVPR 2025
DiffSim: Taming Diffusion Models for Evaluating Visual Similarity
ICCV 2025
LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer
ICCV 2025
Free-ATM: Harnessing Free Attention Masks for Representation Learning on Diffusion-Generated Images
ECCV 2024
One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
NIPS 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
NIPS 2024
Visual Perception by Large Language Modelβs Weights
NIPS 2024
Can Simple Averaging Defeat Modern Watermarks?
NIPS 2024
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
NIPS 2024
DoFIT: Domain-aware Federated Instruction Tuning with Alleviated Catastrophic Forgetting
NIPS 2024
VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation
NIPS 2024
LOVA3: Learning to Visual Question Answering, Asking and Assessment
NIPS 2024
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models
NIPS 2024
Skinned Motion Retargeting with Dense Geometric Interaction Perception
NIPS 2024
Exocentric-to-Egocentric Video Generation
NIPS 2024
VideoLLM-online: Online Video Large Language Model for Streaming Video
CVPR 2024
L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream
CVPR 2024
Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis
CVPR 2024
Bootstrapping SparseFormers from Vision Foundation Models
CVPR 2024
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
CVPR 2024
VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
CVPR 2024
Tune-An-Ellipse: CLIP Has Potential to Find What You Want
CVPR 2024
Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives
CVPR 2024
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
CVPR 2024
ViT-Lens: Towards Omni-modal Representations
CVPR 2024
AssistGUI: Task-Oriented PC Graphical User Interface Automation
CVPR 2024
DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing
CVPR 2024
Drag Anything: Motion Control for Anything using Entity Representation
ECCV 2024
GENIXER: Empowering Multimodal Large Language Models as a Powerful Data Generator
ECCV 2024
Rethinking Tree-Ring Watermarking for Enhanced Multi-Key Identification
ECCV 2024
Parrot Captions Teach CLIP to Spot Text
ECCV 2024
Learning Video Context as Interleaved Multimodal Sequences
ECCV 2024
MotionDirector: Motion Customization of Text-to-Video Diffusion Models
ECCV 2024
SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
ICLR 2024
Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition
IJCAI 2024
Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces
IJCAI 2024
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
ACL 2023
Darwinian Model Upgrades: Model Evolving with Selective Compatibility
AAAI 2023
PV3D: A 3D Generative Model for Portrait Video Generation
ICLR 2023
XAGen: 3D Expressive Human Avatars Generation
NIPS 2023
Video-Text Pre-training with Learned Regions for Retrieval
AAAI 2023
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval
CVPR 2023
Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models
NIPS 2023
Object-centric Learning with Cyclic Walks between Parts and Whole
NIPS 2023
Position-Guided Text Prompt for Vision-Language Pre-Training
CVPR 2023
Learning Visual Prior via Generative Pre-Training
NIPS 2023
DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models
NIPS 2023
All in One: Exploring Unified Video-Language Pre-Training
CVPR 2023
Making Vision Transformers Efficient From a Token Sparsification View
CVPR 2023
Affordance Grounding From Demonstration Video To Target Image
CVPR 2023
DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models
ICCV 2023
Too Large; Data Reduction for Vision-Language Pre-Training
ICCV 2023
STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition
ICCV 2023
Unsupervised Open-Vocabulary Object Localization in Videos
ICCV 2023
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
ICCV 2023
Revisiting Vision Transformer from the View of Path Ensemble
ICCV 2023
Learning to Learn: How to Continuously Teach Humans and Machines
ICCV 2023
UniVTG: Towards Unified Video-Language Temporal Grounding
ICCV 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
ICCV 2023
BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
ICCV 2023
HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
ICCV 2023
Label-Efficient Online Continual Object Detection in Streaming Video
ICCV 2023
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
AAAI 2023
MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering
CVPR 2023
"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"
ECCV 2022
Egocentric Video-Language Pretraining
NIPS 2022
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant
EMNLP 2022
Unified Transformer Tracker for Object Tracking
CVPR 2022
Object-Aware Video-Language Pre-Training for Retrieval
CVPR 2022
DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes
NIPS 2022
Ego4D: Around the World in 3,000 Hours of Egocentric Video
CVPR 2022
MorphMLP: An Efficient MLP-Like Backbone for Spatial-Temporal Representation Learning
ECCV 2022
AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant
ECCV 2022
Channel Augmented Joint Learning for Visible-Infrared Recognition
ICCV 2021
On Pursuit of Designing Multi-modal Transformer for Video Grounding
EMNLP 2021
Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization
CVPR 2021
Generic Event Boundary Detection: A Benchmark for Event Segmentation
ICCV 2021
Searching for Two-Stream Models in Multivariate Space for Video Recognition
ICCV 2021