Ying Shan

150 papers · 2020–2026 · 11 conferences · across top CS/AI conferences

Achievements

+15 more ↓

🗺️ Taxonomy Completionist (14) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (5) 🌍 Conference Polyglot (11)

🌉 Interdisciplinary Bridge 🗺️ Taxonomy Completionist (14) 🧭 Keyword Pioneer 🏠 Conference Loyalist (22) 🏆 Grand Slam 👑 Triple Crown 🤝 Dynamic Duo (43) 🔬 Deep Specialist (37) 🧬 Topic Evolution 🏆 Keyword Champion (4) ❓ The Questioner (2) 🗃️ Keyword Collector (559) 💎 Century Club (149) 🔥 Unstoppable (6) ⚡ Prolific Year (49)

Conferences

CVPR (55) ICCV (22) ECCV (17) NIPS (16) AAAI (14) ICLR (9) ACL (5) ICML (5) IJCAI (3) INTERSPEECH (3) NAACL (1)

Top co-authors

Yixiao Ge (43) Xintao Wang (37) Zhongang Qi (22) Yong Zhang (20) Yan-Pei Cao (18) Xiaohu Qie (17) Xiaodong Cun (16) Xiaoyu Li (11) Yuying Ge (10) Mike Zheng Shou (10)

Research topics

Privacy (1)

Keywords

diffusion model (26) neural radiance field (13) novel view synthesis (12) 3d reconstruction (11) multimodal learning (10) video generation (9) image generation (7) multi-modal learning (7) object detection (7) representation learning (6) video understanding (6) text-to-image generation (5) zero-shot learning (5) neural network (5) transfer learning (5) vision transformer (5) multimodal large language model (5) image synthesis (4) depth estimation (4) generative adversarial network (4)

Papers

MMhops-R1: Multimodal Multi-hop Reasoning AAAI 2026 AnimeGamer: Infinite Anime Life Simulation with Next Game State Prediction ICCV 2025 Image Conductor: Precision Control for Interactive Video Synthesis AAAI 2025 CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities AAAI 2025 GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors ICCV 2025 GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers ICCV 2025 Taming Rectified Flow for Inversion and Editing ICML 2025 Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots NAACL 2025 FreeSplatter: Pose-free Gaussian Splatting for Sparse-view 3D Reconstruction ICCV 2025 HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding ICML 2025 LoRA-Gen: Specializing Large Language Model via Online LoRA Generation ICML 2025 Moto: Latent Motion Token as the Bridging Language for Learning Robot Manipulation from Videos ICCV 2025 DI-PCG: Diffusion-based Efficient Inverse Procedural Content Generation for High-quality 3D Asset Creation CVPR 2025 Mani-GS: Gaussian Splatting Manipulation with Triangular Mesh CVPR 2025 Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation CVPR 2025 DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation CVPR 2025 NVComposer: Boosting Generative Novel View Synthesis with Multiple Sparse and Unposed Images CVPR 2025 VisionMath: Vision-Form Mathematical Problem-Solving ICCV 2025 Mono2Stereo: A Benchmark and Empirical Study for Stereo Conversion CVPR 2025 DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos CVPR 2025 DepthSync: Diffusion Guidance-Based Depth Synchronization for Scale- and Geometry-Consistent Video Depth Estimation ICCV 2025 Scalable Image Tokenization with Index Backpropagation Quantization ICCV 2025 Mamba-3VL: Taming State Space Model for 3D Vision Language Learning ICCV 2025 TrajectoryCrafter: Redirecting Camera Trajectory for Monocular Videos via Diffusion Models ICCV 2025 AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild INTERSPEECH 2024 CV-VAE: A Compatible Video VAE for Latent Generative Video Models NIPS 2024 ReVideo: Remake a Video with Motion and Content Control NIPS 2024 SparseGNV: Generating Novel Views of Indoor Scenes with Sparse RGB-D Images AAAI 2024 SC-NeuS: Consistent Neural Surface Reconstruction from Sparse and Noisy Views AAAI 2024 T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models AAAI 2024 SphereDiffusion: Spherical Geometry-Aware Distortion Resilient Diffusion Model AAAI 2024 A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields AAAI 2024 Sparse3D: Distilling Multiview-Consistent Diffusion for Object Reconstruction from Sparse Views AAAI 2024 E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding NIPS 2024 MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions NIPS 2024 MambaTree: Tree Topology is All You Need in State Space Model NIPS 2024 LLaMA Pro: Progressive LLaMA with Block Expansion ACL 2024 Programmable Motion Generation for Open-Set Motion Control Tasks CVPR 2024 ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis CVPR 2024 HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting CVPR 2024 Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis CVPR 2024 PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding CVPR 2024 SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models CVPR 2024 Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs CVPR 2024 DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models CVPR 2024 BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning CVPR 2024 HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion CVPR 2024 DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing CVPR 2024 Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities CVPR 2024 ViT-Lens: Towards Omni-modal Representations CVPR 2024 YOLO-World: Real-Time Open-Vocabulary Object Detection CVPR 2024 EvalCrafter: Benchmarking and Evaluating Large Video Generation Models CVPR 2024 DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing CVPR 2024 VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models CVPR 2024 GS-IR: 3D Gaussian Splatting for Inverse Rendering CVPR 2024 UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition CVPR 2024 SEED-Bench: Benchmarking Multimodal Large Language Models CVPR 2024 How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? CVPR 2024 MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model ECCV 2024 BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion ECCV 2024 Texture-GS: Disentangle the Geometry and Texture for 3D Gaussian Splatting Editing ECCV 2024 DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment ECCV 2024 Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation ECCV 2024 Noise Calibration: Plug-and-play Content-Preserving Video Enhancement using Pre-trained Video Diffusion Models ECCV 2024 DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors ECCV 2024 EA-VTR: Event-Aware Video-Text Retrieval ECCV 2024 DMiT: Deformable Mipmapped Tri-Plane Representation for Dynamic Scenes ECCV 2024 ST-LLM: Large Language Models Are Effective Temporal Learners ECCV 2024 HiFi-123: Towards High-fidelity One Image to 3D Content Generation ECCV 2024 FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling ICLR 2024 DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models ICLR 2024 TapMo: Shape-aware Motion Generation of Skeleton-free Characters ICLR 2024 Making LLaMA SEE and Draw with SEED Tokenizer ICLR 2024 ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models ICLR 2024 Masked Image Modeling with Denoising Contrast ICLR 2023 PanoGRF: Generalizable Spherical Radiance Fields for Wide-baseline Panoramas NIPS 2023 Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models NIPS 2023 CL-NeRF: Continual Learning of Neural Radiance Fields for Evolving Scene Representation NIPS 2023 Exploiting Contextual Objects and Relations for 3D Visual Grounding NIPS 2023 Meta-Adapter: An Online Few-shot Learner for Vision-Language Model NIPS 2023 GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction NIPS 2023 Inserting Anybody in Diffusion Models via Celeb Basis NIPS 2023 Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval AAAI 2023 Accelerating the Training of Video Super-resolution Models AAAI 2023 Mitigating Artifacts in Real-World Video Super-resolution Models AAAI 2023 Darwinian Model Upgrades: Model Evolving with Selective Compatibility AAAI 2023 What Does Your Face Sound Like? 3D Face Shape towards Voice AAAI 2023 DSRM: Boost Textual Adversarial Training with Distribution Shift Risk Minimization ACL 2023 A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition ACL 2023 Characterizing the Impacts of Instances on Robustness ACL 2023 On the Universal Adversarial Perturbations for Efficient Data-free Adversarial Detection ACL 2023 Accelerating Vision-Language Pretraining With Free Language Modeling CVPR 2023 3D GAN Inversion With Facial Symmetry Prior CVPR 2023 Generating Human Motion From Textual Descriptions With Discrete Representations CVPR 2023 DPE: Disentanglement of Pose and Expression for General Video Portrait Editing CVPR 2023 DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks CVPR 2023 Improved Test-Time Adaptation for Domain Generalization CVPR 2023 HRDFuse: Monocular 360deg Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions CVPR 2023 High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors CVPR 2023 All in One: Exploring Unified Video-Language Pre-Training CVPR 2023 SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation CVPR 2023 Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields CVPR 2023 LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation CVPR 2023 OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer CVPR 2023 Learning Anchor Transformations for 3D Garment Animation CVPR 2023 ViLEM: Visual-Language Error Modeling for Image-Text Retrieval CVPR 2023 RILS: Masked Visual Reconstruction in Language Semantic Space CVPR 2023 SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes CVPR 2023 Skinned Motion Retargeting With Residual Perception of Motion Semantics & Geometry CVPR 2023 Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models CVPR 2023 Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection ICCV 2023 Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation ICCV 2023 Order-Prompted Tag Sequence Generation for Video Tagging ICCV 2023 MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing ICCV 2023 FateZero: Fusing Attentions for Zero-shot Text-based Video Editing ICCV 2023 Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video ICCV 2023 HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video ICCV 2023 OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution ICCV 2023 Exploring Model Transferability through the Lens of Potential Energy ICCV 2023 $\pi$-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation ICML 2023 DeSRA: Detect and Delete the Artifacts of GAN-based Real-World Super-Resolution Models ICML 2023 SGAT4PASS: Spherical Geometry-Aware Transformer for PAnoramic Semantic Segmentation IJCAI 2023 Prosody Modeling with 3D Visual Information for Expressive Video Dubbing INTERSPEECH 2023 MILES: Visual BERT Pre-training with Injected Language Semantics for Video-Text Retrieval ECCV 2022 Mc-BEiT: Multi-Choice Discretization for Image BERT Pre-training ECCV 2022 VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder ECCV 2022 Metric Learning Based Interactive Modulation for Real-World Super-Resolution ECCV 2022 UMT: Unified Multi-Modal Transformers for Joint Video Moment Retrieval and Highlight Detection CVPR 2022 Temporally Efficient Vision Transformer for Video Instance Segmentation CVPR 2022 BTS: A Bi-Lingual Benchmark for Text Segmentation in the Wild CVPR 2022 Object-Aware Video-Language Pre-Training for Retrieval CVPR 2022 Bridging Video-Text Retrieval With Multiple Choice Questions CVPR 2022 DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes NIPS 2022 Towards Universal Backward-Compatible Representation Learning IJCAI 2022 AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos NIPS 2022 A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion INTERSPEECH 2022 Not All Models Are Equal: Predicting Model Transferability in a Self-Challenging Fisher Space ECCV 2022 Dynamic Token Normalization improves Vision Transformers ICLR 2022 Uncertainty Modeling for Out-of-Distribution Generalization ICLR 2022 Hot-Refresh Model Upgrades with Regression-Free Compatible Training in Image Retrieval ICLR 2022 Crossover Learning for Fast Online Video Instance Segmentation ICCV 2021 Finding Discriminative Filters for Specific Degradations in Blind Super-Resolution NIPS 2021 Instances As Queries ICCV 2021 Towards Real-World Blind Face Restoration With Generative Facial Prior CVPR 2021 Open-Book Video Captioning With Retrieve-Copy-Generate Network CVPR 2021 Distilling Audio-Visual Knowledge by Compositional Contrastive Learning CVPR 2021 Towards Vivid and Diverse Image Colorization With Generative Color Prior ICCV 2021 Detecting Interactions from Neural Networks via Topological Analysis NIPS 2020 Feature Augmented Memory with Global Attention Network for VideoQA IJCAI 2020 Fast Video Object Segmentation using the Global Context Module ECCV 2020