Zhengyuan Yang

50 papers · 2019–2026 · 12 conferences · across top CS/AI conferences

Achievements

+12 more ↓

🌍 Conference Polyglot (12) 🏃 Academic Marathon (7) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (14)

🐝 Cross-Pollinator (14) 🌈 Renaissance Researcher (6) 🗺️ Taxonomy Completionist (70) 🔬 Deep Specialist (15) 🤝 Dynamic Duo (37) 🏆 Grand Slam 🧬 Topic Evolution 💎 Century Club (48) 🗃️ Keyword Collector (161) 🔥 Unstoppable (8) ❓ The Questioner (2) ⚡ Prolific Year (14)

Conferences

CVPR (11) ICCV (8) ICLR (6) ECCV (5) ICML (4) AAAI (3) ACL (3) NIPS (3) EMNLP (2) IJCAI (2) WACV (2) NAACL (1)

Top co-authors

Lijuan Wang (38) Linjie Li (28) Jianfeng Wang (23) Kevin Lin (19) Zicheng Liu (18) Chung-Ching Lin (12) Jiebo Luo (8) Zhe Gan (6) Liwei Wang (4) Nan Duan (4)

Keywords

multimodal learning (12) diffusion model (7) large language model (5) image captioning (4) image generation (4) video generation (4) text-to-image generation (4) vision-language model (4) visual question answering (4) zero-shot learning (3) visual grounding (3) multi-modal learning (3) in-context learning (3) object detection (3) graph neural network (3) video understanding (2) metric learning (2) scene graph generation (2) benchmark evaluation (2) transfer learning (2)

Papers

Conditional Text-to-Image Generation with Reference Guidance WACV 2026 Shanks: Simultaneous Hearing and Thinking for Spoken Language Models ACL 2026 TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering AAAI 2026 Zero-Shot Audio-Visual Editing via Cross-Modal Delta Denoising WACV 2026 GLIMPSE: Do Large Vision-Language Models Truly Think With Videos or Just Glimpse at Them? EMNLP 2025 Audio-Aware Large Language Models as Judges for Speaking Styles EMNLP 2025 MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos ICLR 2025 EditRoom: LLM-parameterized Graph Diffusion for Composable 3D Room Layout Editing ICLR 2025 ImageGen-CoT: Enhancing Text-to-Image In-context Learning with Chain-of-Thought Reasoning ICCV 2025 SITE: towards Spatial Intelligence Thorough Evaluation ICCV 2025 Scaling Inference-Time Search with Vision Value Model for Improved Visual Comprehension ICCV 2025 SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation ICLR 2025 LiVOS: Light Video Object Segmentation with Gated Linear Matching CVPR 2025 ShowUI: One Vision-Language-Action Model for GUI Visual Agent CVPR 2025 Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization ICLR 2025 GenXD: Generating Any 3D and 4D Scenes ICLR 2025 Design2Code: Benchmarking Multimodal Code Generation for Automated Front-End Engineering NAACL 2025 Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark ICML 2025 ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding ICML 2025 Interfacing Foundation Models' Embeddings NIPS 2024 VideoGUI: A Benchmark for GUI Automation from Instructional Videos NIPS 2024 Motion Consistency Model: Accelerating Video Diffusion with Disentangled Motion-Appearance Distillation NIPS 2024 SGFormer: Semantic Graph Transformer for Point Cloud-Based 3D Scene Graph Generation AAAI 2024 MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning CVPR 2024 Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning CVPR 2024 DisCo: Disentangled Control for Realistic Human Dance Generation CVPR 2024 MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos CVPR 2024 IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation ECCV 2024 Idea2Img: Iterative Self-Refinement with GPT-4V for Automatic Image Design and Generation ECCV 2024 GRiT: A Generative Region-to-text Transformer for Object Understanding ECCV 2024 StrokeNUWA—Tokenizing Strokes for Vector Graphic Synthesis ICML 2024 MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities ICML 2024 Bring Metric Functions into Diffusion Models IJCAI 2024 NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation ACL 2023 Prompting GPT-3 To Be Reliable ICLR 2023 Equivariant Similarity for Vision-Language Foundation Models ICCV 2023 PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 ICCV 2023 Learning 3D Photography Videos via Self-supervised Diffusion on Single Images IJCAI 2023 ReCo: Region-Controlled Text-to-Image Generation CVPR 2023 Scaling Up Vision-Language Pre-Training for Image Captioning CVPR 2022 UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling ECCV 2022 An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA AAAI 2022 Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation CVPR 2021 SAT: 2D Semantics Assisted Training for 3D Visual Grounding ICCV 2021 TransVG: End-to-End Visual Grounding With Transformers ICCV 2021 TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption CVPR 2021 A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation ACL 2020 Improving One-stage Visual Grounding by Recursive Sub-query Construction ECCV 2020 A Fast and Accurate One-Stage Approach to Visual Grounding ICCV 2019 Attentive Relational Networks for Mapping Images to Scene Graphs CVPR 2019