Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually
AAAI 2024
DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
CVPR 2024
Analyzing Key Factors Influencing Emotion Prediction Performance of VLLMs in Conversational Contexts
EMNLP 2024
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use
CVPR 2024
Revisiting motion information for RGB-Event tracking with MOT philosophy
NIPS 2024
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
EMNLP 2024
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
EMNLP 2024
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you!
EMNLP 2024
VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values
EMNLP 2024
Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld
CVPR 2024
Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison
EMNLP 2024
By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting
EMNLP 2024
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis
CVPR 2024
Towards Low-Resource Harmful Meme Detection with LMM Agents
EMNLP 2024
VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
EMNLP 2024
Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels
CVPR 2024
TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control
EMNLP 2024
LLMs are Good Action Recognizers
CVPR 2024
Tag-grounded Visual Instruction Tuning with Retrieval Augmentation
EMNLP 2024
RWKV-CLIP: A Robust Vision-Language Representation Learner
EMNLP 2024
A Hierarchical Network for Multimodal Document-Level Relation Extraction
AAAI 2024
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
EMNLP 2024
Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector
AAAI 2024
LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models
EMNLP 2024
Deciphering Cognitive Distortions in Patient-Doctor Mental Health Conversations: A Multimodal LLM-Based Detection and Reasoning Framework
EMNLP 2024
<
1
…
28
29
30
…
59
>