Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

Cross-Modal Projection in Multimodal LLMs Doesn’t Really Project Visual Attributes to Textual Space ACL 2024

EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models ACL 2024

MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations ACL 2024

Generating Harder Cross-document Event Coreference Resolution Datasets using Metaphoric Paraphrasing ACL 2024

Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities CVPR 2024

SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark ACL 2024

MMToM-QA: Multimodal Theory of Mind Question Answering ACL 2024

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction ACL 2024

OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models ACL 2024

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models ACL 2024

Neural Sign Actors: A Diffusion Model for 3D Sign Language Production from Text CVPR 2024

Evaluating Very Long-Term Conversational Memory of LLM Agents ACL 2024

InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model ACL 2024

Learning to Decode Collaboratively with Multiple Language Models ACL 2024

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception ACL 2024

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models ACL 2024

You Only Look at Screens: Multimodal Chain-of-Action Agents ACL 2024

SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models CVPR 2024

Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality ACL 2024

SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos CVPR 2024

Tell Me What’s Next: Textual Foresight for Generic UI Representations ACL 2024

MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models CVPR 2024

Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors CVPR 2024

Semantics-aware Motion Retargeting with Vision-Language Models CVPR 2024

Language Models as Black-Box Optimizers for Vision-Language Models CVPR 2024