Zhengyuan Yang
50 papers · 2019–2026 · 12 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+12 more ↓ Show less ↑
π Conference Polyglot (12) π Academic Marathon (7) π Interdisciplinary Bridge π§ Keyword Pioneer π Cross-Pollinator (14)
π
Cross-Pollinator
(14)
π
Renaissance Researcher
(6)
πΊοΈ
Taxonomy Completionist
(70)
π¬
Deep Specialist
(15)
π€
Dynamic Duo
(37)
π
Grand Slam
π§¬
Topic Evolution
π
Century Club
(48)
ποΈ
Keyword Collector
(161)
π₯
Unstoppable
(8)
β
The Questioner
(2)
β‘
Prolific Year
(14)
Conferences
CVPR (11)
ICCV (8)
ICLR (6)
ECCV (5)
ICML (4)
AAAI (3)
ACL (3)
NIPS (3)
EMNLP (2)
IJCAI (2)
WACV (2)
NAACL (1)
Top co-authors
Keywords
multimodal learning
(12)
diffusion model
(7)
large language model
(5)
image captioning
(4)
image generation
(4)
video generation
(4)
text-to-image generation
(4)
vision-language model
(4)
visual question answering
(4)
zero-shot learning
(3)
visual grounding
(3)
multi-modal learning
(3)
in-context learning
(3)
object detection
(3)
graph neural network
(3)
video understanding
(2)
metric learning
(2)
scene graph generation
(2)
benchmark evaluation
(2)
transfer learning
(2)
Papers
Conditional Text-to-Image Generation with Reference Guidance
WACV 2026
Shanks: Simultaneous Hearing and Thinking for Spoken Language Models
ACL 2026
TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering
AAAI 2026
Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising
WACV 2026
GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them?
EMNLP 2025
Audio-Aware Large Language Models as Judges for Speaking Styles
EMNLP 2025
MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos
ICLR 2025
EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing
ICLR 2025
ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning
ICCV 2025
SITE: towards Spatial Intelligence Thorough Evaluation
ICCV 2025
Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension
ICCV 2025
SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation
ICLR 2025
LiVOS: Light Video Object Segmentation with Gated Linear Matching
CVPR 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization
ICLR 2025
GenXD: Generating Any 3D and 4D Scenes
ICLR 2025
Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering
NAACL 2025
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
ICML 2025
ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding
ICML 2025
Interfacing Foundation Models' Embeddings
NIPS 2024
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
NIPS 2024
Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation
NIPS 2024
SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation
AAAI 2024
MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning
CVPR 2024
Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
CVPR 2024
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
CVPR 2024
IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation
ECCV 2024
Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation
ECCV 2024
GRiT: A Generative Region-to-text Transformer for Object Understanding
ECCV 2024
StrokeNUWAβTokenizing Strokes for Vector Graphic Synthesis
ICML 2024
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
ICML 2024
Bring Metric Functions into Diffusion Models
IJCAI 2024
NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation
ACL 2023
Prompting GPT-3 To Be Reliable
ICLR 2023
Equivariant Similarity for Vision-Language Foundation Models
ICCV 2023
PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3
ICCV 2023
Learning 3D Photography Videos via Self-supervised Diffusion on Single Images
IJCAI 2023
ReCo: Region-Controlled Text-to-Image Generation
CVPR 2023
Scaling Up Vision-Language Pre-Training for Image Captioning
CVPR 2022
UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
ECCV 2022
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA
AAAI 2022
Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation
CVPR 2021
SAT: 2D Semantics Assisted Training for 3D Visual Grounding
ICCV 2021
TransVG: End-to-End Visual Grounding With Transformers
ICCV 2021
TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption
CVPR 2021
A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation
ACL 2020
Improving One-stage Visual Grounding by Recursive Sub-query Construction
ECCV 2020
A Fast and Accurate One-Stage Approach to Visual Grounding
ICCV 2019
Attentive Relational Networks for Mapping Images to Scene Graphs
CVPR 2019