Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
Cross-Modal Projection in Multimodal LLMs Doesn’t Really Project Visual Attributes to Textual Space
ACL 2024
EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-Language Models
ACL 2024
MTP: A Dataset for Multi-Modal Turning Points in Casual Conversations
ACL 2024
Generating Harder Cross-document Event Coreference Resolution Datasets using Metaphoric Paraphrasing
ACL 2024
Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities
CVPR 2024
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark
ACL 2024
MMToM-QA: Multimodal Theory of Mind Question Answering
ACL 2024
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction
ACL 2024
OpenWebAgent: An Open Toolkit to Enable Web Agents on Large Language Models
ACL 2024
Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models
ACL 2024
Neural Sign Actors: A Diffusion Model for 3D Sign Language Production from Text
CVPR 2024
Evaluating Very Long-Term Conversational Memory of LLM Agents
ACL 2024
InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model
ACL 2024
Learning to Decode Collaboratively with Multiple Language Models
ACL 2024
XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception
ACL 2024
Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
ACL 2024
You Only Look at Screens: Multimodal Chain-of-Action Agents
ACL 2024
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
CVPR 2024
Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality
ACL 2024
SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos
CVPR 2024
Tell Me What’s Next: Textual Foresight for Generic UI Representations
ACL 2024
MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models
CVPR 2024
Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors
CVPR 2024
Semantics-aware Motion Retargeting with Vision-Language Models
CVPR 2024
Language Models as Black-Box Optimizers for Vision-Language Models
CVPR 2024
<
1
…
17
18
19
…
59
>