Papers
Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability
Shenyuan Gao, Jiazhi Yang, Li Chen et al.
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
Haogeng Liu, Quanzeng You, Xiaotian Han et al.
Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
Keyu Tian, Yi Jiang, Zehuan Yuan et al.
Visual CoT: Advancing Multi-Modal Language Models with a Comprehensive Dataset and Benchmark for Chain-of-Thought Reasoning
Hao Shao, Shengju Qian, Han Xiao et al.
Visual Data Diagnosis and Debiasing with Concept Graphs
Rwiddhi Chakraborty, Yinong (Oliver) Wang, Jialu Gao et al.
Visual Decoding and Reconstruction via EEG Embeddings with Guided Diffusion
Dongyang Li, Chen Wei, Shiying Li et al.
Visual Fourier Prompt Tuning
Runjia Zeng, Cheng Han, Qifan Wang et al.
Visual Perception by Large Language Model’s Weights
Feipeng Ma, Hongwei Xue, Yizhou Zhou et al.
Visual Pinwheel Centers Act as Geometric Saliency Detectors
Haixin Zhong, Mingyi Huang, Wei P. Dai et al.
Visual Prompt Tuning in Null Space for Continual Learning
Yue Lu, Shizhou Zhang, De Cheng et al.
Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models
Nitzan Bitton-Guetta, Aviv Slobodkin, Aviya Maimon et al.
Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models
Yushi Hu, Weijia Shi, Xingyu Fu et al.
Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing
Hao Fei, Shengqiong Wu, Hanwang Zhang et al.
Vivid-ZOO: Multi-View Video Generation with Diffusion Model
Bing Li, Cheng Zheng, Wenxuan Zhu et al.
VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance
Divyansh Srivastava, Ge Yan, Tsui-Wei Weng
VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark
Han Huang, Haitian Zhong, Tao Yu et al.
VLM4Bio: A Benchmark Dataset to Evaluate Pretrained Vision-Language Models for Trait Discovery from Biological Images
M. Maruf, Arka Daw, Kazi Sajeed Mehrab et al.
VLM Agents Generate Their Own Memories: Distilling Experience into Embodied Programs of Thought
Gabriel Sarch, Lawrence Jang, Michael J. Tarr et al.
VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions
Guangyan Chen, Meiling Wang, Te Cui et al.
VMamba: Visual State Space Model
Yue Liu, Yunjie Tian, Yuzhong Zhao et al.
Vocal Call Locator Benchmark (VCL) for localizing rodent vocalizations from multi-channel audio
Ralph E Peterson, Aramis Tanelus, Christopher Ick et al.
Voila-A: Aligning Vision-Language Models with User's Gaze Attention
Kun Yan, Zeyu Wang, Lei Ji et al.
Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection
Guowen Zhang, Lue Fan, Chenhang He et al.
Voxel Proposal Network via Multi-Frame Knowledge Distillation for Semantic Scene Completion
Lubo Wang, Di Lin, Kairui Yang et al.
V-PETL Bench: A Unified Visual Parameter-Efficient Transfer Learning Benchmark
Yi Xin, Siqi Luo, Xuyang Liu et al.