Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Deep Learning
›
Learning Types
›
Multi-Modal Learning
3194 directly classified papers
Papers per year
2003: 1
2010: 1
2011: 1
2013: 5
2014: 3
2015: 9
2016: 23
2017: 49
2018: 78
2019: 158
2020: 223
2021: 261
2022: 354
2023: 471
2024: 705
2025: 835
2026: 17
Papers
Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches
CVPR 2024
Prompt-Driven Referring Image Segmentation with Instance Contrasting
CVPR 2024
L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream
CVPR 2024
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
CVPR 2024
CoVR: Learning Composed Video Retrieval from Web Video Captions
AAAI 2024
Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception
AAAI 2024
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
CVPR 2024
SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation
CVPR 2024
Open-Vocabulary Video Relation Extraction
AAAI 2024
Revisiting motion information for RGB-Event tracking with MOT philosophy
NIPS 2024
PromptAD: Zero-Shot Anomaly Detection Using Text Prompts
WACV 2024
Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model
CVPR 2024
TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation
AAAI 2024
Sound3DVDet: 3D Sound Source Detection Using Multiview Microphone Array and RGB Images
WACV 2024
Seeing the Unseen: Visual Common Sense for Semantic Placement
CVPR 2024
RegionGPT: Towards Region Understanding Vision Language Model
CVPR 2024
Simple Image-Level Classification Improves Open-Vocabulary Object Detection
AAAI 2024
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models
CVPR 2024
TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes
CVPR 2024
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
CVPR 2024
Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language
AAAI 2024
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation
CVPR 2024
When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach
CVPR 2024
DisCo: Disentangled Control for Realistic Human Dance Generation
CVPR 2024
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
AAAI 2024
<
1
…
50
51
52
…
128
>