Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities EMNLP 2024

Images Speak Louder than Words: Understanding and Mitigating Bias in Vision-Language Model from a Causal Mediation Perspective EMNLP 2024

Investigating and Mitigating Object Hallucinations in Pretrained Vision-Language (CLIP) Models EMNLP 2024

Updating CLIP to Prefer Descriptions Over Captions EMNLP 2024

RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets EMNLP 2024

PRISM: A New Lens for Improved Color Understanding EMNLP 2024

Text2Model: Text-based Model Induction for Zero-shot Image Classification EMNLP 2024

Enhancing Temporal Modeling of Video LLMs via Time Gating EMNLP 2024

MACAROON: Training Vision-Language Models To Be Your Engaged Partners EMNLP 2024

AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models EMNLP 2024

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs EMNLP 2024

Vanessa: Visual Connotation and Aesthetic Attributes Understanding Network for Multimodal Aspect-based Sentiment Analysis EMNLP 2024

V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization EMNLP 2024

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models EMNLP 2024

SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering EMNLP 2024

Grounding Partially-Defined Events in Multimodal Data EMNLP 2024

Unraveling the Truth: Do VLMs really Understand Charts? A Deep Dive into Consistency and Robustness EMNLP 2024

PromptFix: You Prompt and We Fix the Photo NIPS 2024

GrounDiT: Grounding Diffusion Transformers via Noisy Patch Transplantation NIPS 2024

VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud CVPR 2023

Representing Volumetric Videos As Dynamic MLP Maps CVPR 2023

SeaThru-NeRF: Neural Radiance Fields in Scattering Media CVPR 2023

HRDFuse: Monocular 360deg Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions CVPR 2023

MetaCLUE: Towards Comprehensive Visual Metaphors Research CVPR 2023

PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes CVPR 2023