Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
Multi-Modal Learning
1457 directly classified papers
Papers per year
2011: 1
2013: 4
2014: 3
2015: 3
2016: 9
2017: 11
2018: 27
2019: 61
2020: 109
2021: 87
2022: 153
2023: 213
2024: 391
2025: 384
2026: 1
Papers
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
NIPS 2024
A Large-Scale Human-Centric Benchmark for Referring Expression Comprehension in the LMM Era
NIPS 2024
Training-Free Open-Ended Object Detection and Segmentation via Attention as Prompts
NIPS 2024
MM-WLAuslan: Multi-View Multi-Modal Word-Level Australian Sign Language Recognition Dataset
NIPS 2024
ViLCo-Bench: VIdeo Language COntinual learning Benchmark
NIPS 2024
OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
NIPS 2024
OmniJARVIS: Unified Vision-Language-Action Tokenization Enables Open-World Instruction Following Agents
NIPS 2024
Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control
NIPS 2024
MambaTree: Tree Topology is All You Need in State Space Model
NIPS 2024
Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
NIPS 2024
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
NIPS 2024
GTA: A Benchmark for General Tool Agents
NIPS 2024
TrajCLIP: Pedestrian trajectory prediction method using contrastive learning and idempotent networks
NIPS 2024
DevBench: A multimodal developmental benchmark for language learning
NIPS 2024
Animal-Bench: Benchmarking Multimodal Video Models for Animal-centric Video Understanding
NIPS 2024
SongCreator: Lyrics-based Universal Song Generation
NIPS 2024
On the Comparison between Multi-modal and Single-modal Contrastive Learning
NIPS 2024
CultureLLM: Incorporating Cultural Differences into Large Language Models
NIPS 2024
HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model
NIPS 2024
WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games
NIPS 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
NIPS 2024
Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models
NIPS 2024
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
NIPS 2024
Mitigating Object Hallucination via Concentric Causal Attention
NIPS 2024
Boosting Weakly Supervised Referring Image Segmentation via Progressive Comprehension
NIPS 2024
<
1
…
30
31
32
…
59
>