Di Zhang

59 papers · 2023–2026 · 12 conferences · across top CS/AI conferences

Achievements

+10 more ↓

🌍 Conference Polyglot (12) 🐝 Cross-Pollinator (7) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (7)

🧭 Keyword Pioneer 🌈 Renaissance Researcher (7) 🏆 Grand Slam 👑 Triple Crown 🤝 Dynamic Duo (20) 🔬 Deep Specialist (11) ⚡ Prolific Year (16) 🚀 Conference Pioneer 🗃️ Keyword Collector (244) 💎 Century Club (54)

Conferences

CVPR (10) ACL (8) EMNLP (8) ICCV (8) ICLR (7) AAAI (4) ICML (4) COLING (3) NIPS (3) NAACL (2) CORL (1) MICCAI (1)

Top co-authors

Pengfei Wan (22) Kun Gai (14) Fuzheng Zhang (12) Xintao Wang (9) Xin Tao (8) Tingting Gao (8) Fan Yang (6) Dongzhan Zhou (4) Chengru Song (4) Menghan Xia (4)

Keywords

large language model (11) video generation (10) diffusion model (9) instruction tuning (5) reinforcement learning (4) multimodal large language model (3) text-to-video generation (3) preference optimization (3) diffusion transformer (3) image generation (3) text-to-image generation (3) video diffusion (2) benchmark evaluation (2) language model (2) temporal coherence (2) gaussian splatting (2) video captioning (2) instruction following (2) prompt engineering (2) video understanding (2)

Papers

FilmWeaver: Weaving Consistent Multi-Shot Videos with Cache-Guided Autoregressive Diffusion AAAI 2026 Boosting Resolution Generalization of Diffusion Transformers with Randomized Positional Encodings AAAI 2026 From Detection to Understanding: Multi-Turn Reasoning for Video Misinformation Analysis ACL 2026 Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts ACL 2026 TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs AAAI 2026 iMOVE : Instance-Motion-Aware Video Understanding ACL 2025 Stable Segment Anything Model ICLR 2025 KineDex: Learning Tactile-Informed Visuomotor Policies via Kinesthetic Teaching for Dexterous Manipulation CORL 2025 ChemVLM: Exploring the Power of Multimodal Large Language Models in Chemistry Area AAAI 2025 HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models ACL 2025 VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation ACL 2025 How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach ICCV 2025 ReCamMaster: Camera-Controlled Generative Rendering from A Single Video ICCV 2025 GameFactory: Creating New Games with Generative Interactive Videos ICCV 2025 Imbalance in Balance: Online Concept Balancing in Generation Models ICCV 2025 GGTalker: Talking Head Systhesis with Generalizable Gaussian Priors and Identity-Specific Adaptation ICCV 2025 FullDiT: Video Generative Foundation Models with Multimodal Control via Full Attention ICCV 2025 MUSE: Multi-Subject Unified Synthesis via Explicit Layout Semantic Expansion ICCV 2025 Scene Graph Guided Generation: Enable Accurate Relations Generation in Text-to-Image Models via Textural Rectification ICCV 2025 Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model ICLR 2025 TaskGalaxy: Scaling Multi-modal Instruction Fine-tuning with Tens of Thousands Vision Task Types ICLR 2025 Cafe-Talk: Generating 3D Talking Face Animation with Multimodal Coarse- and Fine-grained Control ICLR 2025 SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints ICLR 2025 3DTrajMaster: Mastering 3D Trajectory for Multi-Entity Motion in Video Generation ICLR 2025 MODA: MOdular Duplex Attention for Multimodal Perception, Cognition, and Emotion Understanding ICML 2025 MM-RLHF: The Next Step Forward in Multimodal LLM Alignment ICML 2025 CERTAIN: Context Uncertainty-aware One-Shot Adaptation for Context-based Offline Meta Reinforcement Learning ICML 2025 LLaMA-Berry: Pairwise Optimization for Olympiad-level Mathematical Reasoning via O1-like Monte Carlo Tree Search NAACL 2025 Chain-of-Specificity: Enhancing Task-Specific Constraint Adherence in Large Language Models COLING 2025 Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models COLING 2025 SketchVideo: Sketch-based Video Generation and Editing CVPR 2025 StyleMaster: Stylize Your Video with Artistic Generation and Translation CVPR 2025 Koala-36M: A Large-scale Video Dataset Improving Consistency between Fine-grained Conditions and Video Content CVPR 2025 Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning CVPR 2025 PatchVSR: Breaking Video Diffusion Resolution Limits with Patch-wise Video Super-Resolution CVPR 2025 GPAvatar: High-fidelity Head Avatars by Learning Efficient Gaussian Projections CVPR 2025 Libra-Merging: Importance-redundancy and Pruning-merging Trade-off for Acceleration Plug-in in Large Vision-Language Model CVPR 2025 Towards Precise Scaling Laws for Video Diffusion Transformers CVPR 2025 Unleashing the Potential of Multi-modal Foundation Models and Video Diffusion for 4D Dynamic Physical Scene Simulation CVPR 2025 DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs EMNLP 2025 SPPD: Self-training with Process Preference Learning Using Dynamic Value Margin EMNLP 2025 Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models EMNLP 2025 Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs COLING 2024 ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors EMNLP 2024 Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models ACL 2024 Learning Multi-Dimensional Human Preference for Text-to-Image Generation CVPR 2024 Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization ICLR 2024 Focus On What Matters: Separated Models For Visual-Based RL Generalization NIPS 2024 Hierarchical multiple instance learning for COPD grading with relatively specific similarity MICCAI 2024 DialogBench: Evaluating LLMs as Human-like Dialogue Systems NAACL 2024 Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming EMNLP 2024 VideoTetris: Towards Compositional Text-to-Video Generation NIPS 2024 Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint ACL 2024 Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization ICML 2024 Just Ask One More Time! Self-Agreement Improves Reasoning of Language Models in (Almost) All Scenarios ACL 2024 Evaluating Readability and Faithfulness of Concept-based Explanations EMNLP 2024 Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector EMNLP 2024 Inductive-Deductive Strategy Reuse for Multi-Turn Instructional Dialogues EMNLP 2024 How to Fine-tune the Model: Unified Model Shift and Model Bias Policy Optimization NIPS 2023