Papers
Video OWL-ViT: Temporally-consistent Open-world Localization in Video
Georg Heigold, Matthias Minderer, Alexey Gritsenko et al.
Video State-Changing Object Segmentation
Jiangwei Yu, Xiang Li, Xinran Zhao et al.
Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
Thomas E. Huang, Yifan Liu, Luc Van Gool et al.
VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs
Moayed Haji Ali, Andrew Bond, Tolga Birdal et al.
View Consistent Purification for Accurate Cross-View Localization
Shan Wang, Yanhao Zhang, Akhil Perincherry et al.
Viewing Graph Solvability in Practice
Federica Arrigoni, Tomas Pajdla, Andrea Fusiello
ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding
Zoey Guo, Yiwen Tang, Ray Zhang et al.
Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data
Stanislaw Szymanowicz, Christian Rupprecht, Andrea Vedaldi
ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data
Maya Varma, Jean-Benoit Delbrouck, Sarah Hooper et al.
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang, Zhen Yang, Bin Xu et al.
ViM: Vision Middleware for Unified Downstream Transferring
Yutong Feng, Biao Gong, Jianwen Jiang et al.
VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations
Jiehong Lin, Zewei Wei, Yabin Zhang et al.
ViperGPT: Visual Inference via Python Execution for Reasoning
Dídac Surís, Sachit Menon, Carl Vondrick
Virtual Try-On with Pose-Garment Keypoints Guided Inpainting
Zhi Li, Pengfei Wei, Xiang Yin et al.
Visible-Infrared Person Re-Identification via Semantic Alignment and Affinity Inference
Xingye Fang, Yang Yang, Ying Fu
Vision Grid Transformer for Document Layout Analysis
Cheng Da, Chuwei Luo, Qi Zheng et al.
Vision HGNN: An Image is More than a Graph of Nodes
Yan Han, Peihao Wang, Souvik Kundu et al.
Vision Relation Transformer for Unbiased Scene Graph Generation
Gopika Sudhakaran, Devendra Singh Dhami, Kristian Kersting et al.
Vision Transformer Adapters for Generalizable Multitask Learning
Deblina Bhattacharjee, Sabine Süsstrunk, Mathieu Salzmann
Visual Explanations via Iterated Integrated Attributions
Oren Barkan, Yehonatan Elisha, Yuval Asher et al.
Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World
Qifan Yu, Juncheng Li, Yu Wu et al.
Visual Traffic Knowledge Graph Generation from Scene Images
Yunfei Guo, Fei Yin, Xiao-hui Li et al.
VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching
Junyu Bi, Daixuan Cheng, Ping Yao et al.
VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation
Yanyuan Qiao, Zheng Yu, Qi Wu
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
Zi-Yuan Hu, Yanyang Li, Michael R. Lyu et al.