Papers
11,951 papers found
Vector-ICL: In-context Learning with Continuous Vector Representations
Yufan Zhuang, Chandan Singh, Liyuan Liu et al.
VEDIT: Latent Prediction Architecture For Procedural Video Representation Learning
Han Lin, Tushar Nagarajan, Nicolas Ballas et al.
Verifying Properties of Binary Neural Networks Using Sparse Polynomial Optimization
Jianting Yang, Srecko Durasinovic, Jean B. Lasserre et al.
Vertical Federated Learning with Missing Features During Training and Inference
Pedro Valdeira, Shiqiang Wang, Yuejie Chi
Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement
Xueyao Zhang, Xiaohui Zhang, Kainan Peng et al.
VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models
Lisa Dunlap, Krishna Mandal, Trevor Darrell et al.
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler
Serin Yang, Taesung Kwon, Jong Chul Ye
VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation
Kuo-Han Hung, Pang-Chi Lo, Jia-Fong Yeh et al.
Video Action Differencing
James Burgess, Xiaohan Wang, Yuhui Zhang et al.
VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing
Xiangpeng Yang, Linchao Zhu, Hehe Fan et al.
Video In-context Learning: Autoregressive Transformers are Zero-Shot Video Imitators
Wentao Zhang, Junliang Guo, Tianyu He et al.
VideoPhy: Evaluating Physical Commonsense for Video Generation
Hritik Bansal, Zongyu Lin, Tianyi Xie et al.
VideoShield: Regulating Diffusion-based Video Generation Models via Watermarking
Runyi Hu, Jie Zhang, Yiming Li et al.
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
Orr Zohar, Xiaohan Wang, Yonatan Bitton et al.
VideoWebArena: Evaluating Long Context Multimodal Agents with Video Understanding Web Tasks
Lawrence Keunho Jang, Yinheng Li, Dan Zhao et al.
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
Tianchen Zhao, Tongcheng Fang, Haofeng Huang et al.
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen et al.
ViSAGe: Video-to-Spatial Audio Generation
Jaeyeon Kim, Heeseung Yun, Gunhee Kim
Vision and Language Synergy for Rehearsal Free Continual Learning
Muhammad Anwar Ma'sum, Mahardhika Pratama, Savitha Ramasamy et al.
Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations
Yudi Xie, Weichen Huang, Esther Alter et al.
Vision Language Models are In-Context Value Learners
Yecheng Jason Ma, Joey Hejna, Chuyuan Fu et al.
Vision-LSTM: xLSTM as Generic Vision Backbone
Benedikt Alkin, Maximilian Beck, Korbinian Pöppel et al.
Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Yuchen Duan, Weiyun Wang, Zhe Chen et al.
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Shi Yu, Chaoyue Tang, Bokai Xu et al.
VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents
Xiao Liu, Tianjie Zhang, Yu Gu et al.