Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability ACL 2024

SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information EMNLP 2024

MEANT: Multimodal Encoder for Antecedent Information EMNLP 2024

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions EMNLP 2024

CommVQA: Situating Visual Question Answering in Communicative Contexts EMNLP 2024

VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models EMNLP 2024

Multi-Level Information Retrieval Augmented Generation for Knowledge-based Visual Question Answering EMNLP 2024

Nearest Neighbor Normalization Improves Multimodal Retrieval EMNLP 2024

Mitigating Open-Vocabulary Caption Hallucinations EMNLP 2024

Multiple Knowledge-Enhanced Interactive Graph Network for Multimodal Conversational Emotion Recognition EMNLP 2024

MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans? EMNLP 2024

Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP EMNLP 2024

Individuation in Neural Models with and without Visual Grounding EMNLP 2024

Retrieval Evaluation for Long-Form and Knowledge-Intensive Image–Text Article Composition EMNLP 2024

Benchmarking Visually-Situated Translation of Text in Natural Images EMNLP 2024

Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training CVPR 2024

Domain Prompt Learning with Quaternion Networks CVPR 2024

Make Pixels Dance: High-Dynamic Video Generation CVPR 2024

Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation CVPR 2024

Aligning and Prompting Everything All at Once for Universal Visual Perception CVPR 2024

Matching Anything by Segmenting Anything CVPR 2024

ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks CVPR 2024

RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method CVPR 2024

MoDE: CLIP Data Experts via Clustering CVPR 2024

Relightful Harmonization: Lighting-aware Portrait Background Replacement CVPR 2024