Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
EMNLP 2024
SignCLIP: Connecting Text and Sign Language by Contrastive Learning
EMNLP 2024
RWKV-CLIP: A Robust Vision-Language Representation Learner
EMNLP 2024
VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
EMNLP 2024
DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination
EMNLP 2024
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
AAAI 2024
Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning
CVPR 2024
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
EMNLP 2024
Towards Robust Speech Representation Learning for Thousands of Languages
EMNLP 2024
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
AAAI 2024
CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention
CVPR 2024
CLiC: Concept Learning in Context
CVPR 2024
Self-Powered LLM Modality Expansion for Large Speech-Text Models
EMNLP 2024
Concept-skill Transferability-based Data Selection for Large Vision-Language Models
EMNLP 2024
SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge
CVPR 2024
EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering
AAAI 2024
Diff-BGM: A Diffusion Model for Video Background Music Generation
CVPR 2024
A Multimodal, Multi-Task Adapting Framework for Video Action Recognition
AAAI 2024
VCoder: Versatile Vision Encoders for Multimodal Large Language Models
CVPR 2024
Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline
CVPR 2024
Iterated Learning Improves Compositionality in Large Vision-Language Models
CVPR 2024
Cycle-Consistency Learning for Captioning and Grounding
AAAI 2024
Text2Loc: 3D Point Cloud Localization from Natural Language
CVPR 2024
Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models
CVPR 2024
Bias-Conflict Sample Synthesis and Adversarial Removal Debias Strategy for Temporal Sentence Grounding in Video
AAAI 2024
<
1
…
25
26
27
…
59
>