Papers
VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models
Bingrui Sima, Linhua Cong, Wenxuan Wang et al.
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
Seungwon Lim, Sungwoong Kim, Jihwan Yu et al.
VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding
Zhaowei Liu, Xin Guo, Haotian Xia et al.
Vision-and-Language Navigation with Analogical Textual Descriptions in LLMs
Yue Zhang, Tianyi Ma, Zun Wang et al.
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions
Ioanna Ntinou, Alexandros Xenos, Yassine Ouali et al.
VisiPruner: Decoding Discontinuous Cross-Modal Dynamics for Efficient Multimodal LLMs
Yingqi Fan, Anhao Zhao, Jinlan Fu et al.
VistaWise: Building Cost-Effective Agent with Cross-Modal Knowledge Graph for Minecraft
Honghao Fu, Junlong Ren, Qi Chai et al.
Visual-Aware Speech Recognition for Noisy Scenarios
Balaji Darur, Karan Singla
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Miao Ziqi, Yi Ding, Lijun Li et al.
VisualEDU: A Benchmark for Assessing Coding and Visual Comprehension through Educational Problem-Solving Video Generation
Hao Chen, Tianyu Shi, Pengran Huang et al.
Visual Program Distillation with Template-Based Augmentation
Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem
Visual Self-Refinement for Autoregressive Models
Jiamian Wang, Ziqi Zhou, Chaithanya Kumar Mummadi et al.
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search
Yiming Jia, Jiachen Li, Xiang Yue et al.
VIVA+: Human-Centered Situational Decision-Making
Zhe Hu, Yixiao Ren, Guanzhong Liu et al.
VLA-Mark: A cross modal watermark for large vision-language alignment models
Shuliang Liu, Zheng Qi, Jesse Jiaxi Xu et al.
VLASCD: A Visual Language Action Model for Simultaneous Chatting and Decision Making
Zuojin Tang, Bin Hu, Chenyang Zhao et al.
VLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training
Zhanpeng Chen, Chengjin Xu, Yiyan Qi et al.
VLP: Vision-Language Preference Learning for Embodied Manipulation
Runze Liu, Chenjia Bai, Jiafei Lyu et al.
VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation
Yuhao Wang, Heyang Liu, Ziyang Cheng et al.
VoiceBBQ: Investigating Effect of Content and Acoustics in Social Bias of Spoken Language Model
Junhyuk Choi, Ro-hoon Oh, Jihwan Seol et al.
VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
Zhisheng Zheng, Puyuan Peng, Anuj Diwan et al.
Voice of a Continent: Mapping Africa’s Speech Technology Frontier
AbdelRahim A. Elmadany, Sang Yun Kwon, Hawau Olamide Toyin et al.
VQA-Augmented Machine Translation with Cross-Modal Contrastive Learning
Zhihui Zhang, Shiliang Sun, Jing Zhao et al.
VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu, Longteng Guo, Yepeng Tang et al.
V-SEAM: Visual Semantic Editing and Attention Modulating for Causal Interpretability of Vision-Language Models
Qidong Wang, Junjie Hu, Ming Jiang