Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream CVPR 2024

Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions CVPR 2024

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection EMNLP 2024

Towards Better Vision-Inspired Vision-Language Models CVPR 2024

Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion CVPR 2024

Quantifying the Gaps Between Translation and Native Perception in Training for Multimodal, Multilingual Retrieval EMNLP 2024

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation CVPR 2024

DIEM: Decomposition-Integration Enhancing Multimodal Insights CVPR 2024

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models CVPR 2024

HEAL-SWIN: A Vision Transformer On The Sphere CVPR 2024

Concept-skill Transferability-based Data Selection for Large Vision-Language Models EMNLP 2024

From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis EMNLP 2024

Mitigating Open-Vocabulary Caption Hallucinations EMNLP 2024

Multiple Knowledge-Enhanced Interactive Graph Network for Multimodal Conversational Emotion Recognition EMNLP 2024

SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information EMNLP 2024

RWKV-CLIP: A Robust Vision-Language Representation Learner EMNLP 2024

MEANT: Multimodal Encoder for Antecedent Information EMNLP 2024

World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering EMNLP 2024

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions EMNLP 2024

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! EMNLP 2024

MAR: Matching-Augmented Reasoning for Enhancing Visual-based Entity Question Answering EMNLP 2024

CommVQA: Situating Visual Question Answering in Communicative Contexts EMNLP 2024

Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability ACL 2024

ChartAssistant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning ACL 2024

VisDiaHalBench: A Visual Dialogue Benchmark For Diagnosing Hallucination in Large Vision-Language Models ACL 2024