Papers
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection
Songhao Han, Wei Huang, Hairong Shi et al.
VideoGEM: Training-free Action Grounding in Videos
Felix Vogel, Walid Bousselham, Anna Kukleva et al.
VideoGigaGAN: Towards Detail-rich Video Super-Resolution
Yiran Xu, Taesung Park, Richard Zhang et al.
VideoGLaMM : A Large Multimodal Model for Pixel-Level Visual Grounding in Videos
Shehan Munasinghe, Hanan Gani, Wenqi Zhu et al.
Video-Guided Foley Sound Generation with Multimodal Controls
Ziyang Chen, Prem Seetharaman, Bryan Russell et al.
VideoGuide: Improving Video Diffusion Models without Training Through a Teacher's Guide
Dohun Lee, Bryan Sangwoo Kim, Geon Yeong Park et al.
VideoHandles: Editing 3D Object Compositions in Videos Using Video Generative Priors
Juil Koo, Paul Guerrero, Chun-Hao P. Huang et al.
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding
Kangsan Kim, Geon Park, Youngwan Lee et al.
Video Language Model Pretraining with Spatio-temporal Masking
Yue Wu, Zhaobo Qi, Junshu Sun et al.
VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
Chi-Pin Huang, Yen-Siang Wu, Hung-Kai Chung et al.
Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo et al.
Video Motion Transfer with Diffusion Transformers
Alexander Pondaven, Aliaksandr Siarohin, Sergey Tulyakov et al.
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models
Jinhui Yi, Syed Talal Wasim, Yanan Luo et al.
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM
Yuqian Yuan, Hang Zhang, Wentong Li et al.
VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step
Hanyang Wang, Fangfu Liu, Jiawei Chi et al.
VideoSPatS: Video SPatiotemporal Splines for Disentangled Occlusion, Appearance and Motion Modeling and Editing
Juan Luis Gonzalez, Xu Yao, Alex Whelan et al.
Video Summarization with Large Language Models
Min Jung Lee, Dayoung Gong, Minsu Cho
VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos
Ziyang Wang, Shoubin Yu, Elias Stengel-Eskin et al.
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Zhongwei Ren, Yunchao Wei, Xun Guo et al.
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding
Yan Shu, Zheng Liu, Peitian Zhang et al.
VidHalluc: Evaluating Temporal Hallucinations in Multimodal Large Language Models for Video Understanding
Chaoyu Li, Eun Woo Im, Pooyan Fazli
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian, Zhaoyang Liu, Ruibin Yuan et al.
VidSeg: Training-free Video Semantic Segmentation based on Diffusion Models
Qian Wang, Abdelrahman Eldesokey, Mohit Mendiratta et al.
VidTwin: Video VAE with Decoupled Structure and Dynamics
Yuchi Wang, Junliang Guo, Xinyi Xie et al.
Viewpoint Rosetta Stone: Unlocking Unpaired Ego-Exo Videos for View-invariant Representation Learning
Mi Luo, Zihui Xue, Alex Dimakis et al.