← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

From Sights to Insights: Towards Summarization of Multimodal Clinical Documents ACL 2024

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach CVPR 2024

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models CVPR 2024

3D Feature Tracking via Event Camera CVPR 2024

RGB-X Object Detection via Scene-Specific Fusion Modules WACV 2024

Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation CVPR 2024

SDMTR: A Brain-inspired Transformer for Relation Inference AISTATS 2024

Complex Organ Mask Guided Radiology Report Generation WACV 2024

DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning CVPR 2024

SimDistill: Simulated Multi-Modal Distillation for BEV 3D Object Detection AAAI 2024

Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control AAAI 2024

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception AAAI 2024

TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation AAAI 2024

Beyond Fusion: Modality Hallucination-Based Multispectral Fusion for Pedestrian Detection WACV 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification AAAI 2024

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning AAAI 2024

Open-Vocabulary Video Relation Extraction AAAI 2024

Annotation-Free Audio-Visual Segmentation WACV 2024

CoVR: Learning Composed Video Retrieval from Web Video Captions AAAI 2024

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval AAAI 2024

The MERSA Dataset and a Transformer-Based Approach for Speech Emotion Recognition ACL 2024

InfiMM: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model ACL 2024

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant CVPR 2024

MSU-4S - The Michigan State University Four Seasons Dataset CVPR 2024

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models CVPR 2024