← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale EMNLP 2024

VLP: Vision Language Planning for Autonomous Driving CVPR 2024

Independency Adversarial Learning for Cross-Modal Sound Separation AAAI 2024

Stitching Segments and Sentences towards Generalization in Video-Text Pre-training AAAI 2024

Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models CVPR 2024

Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering CVPR 2024

Domain Prompt Learning with Quaternion Networks CVPR 2024

TD²-Net: Toward Denoising and Debiasing for Video Scene Graph Generation AAAI 2024

Robust Noisy Correspondence Learning with Equivariant Similarity Consistency CVPR 2024

Text-conditional Attribute Alignment across Latent Spaces for 3D Controllable Face Image Synthesis CVPR 2024

Image Captioning with Multi-Context Synthetic Data AAAI 2024

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches CVPR 2024

Prompt-Driven Referring Image Segmentation with Instance Contrasting CVPR 2024

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation CVPR 2024

Seeing the Unseen: Visual Common Sense for Semantic Placement CVPR 2024

Open-Vocabulary Video Relation Extraction AAAI 2024

Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation AAAI 2024

Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach CVPR 2024

What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions CVPR 2024

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream CVPR 2024

Vision-and-Language Navigation via Causal Learning CVPR 2024

Hyperbolic Learning with Synthetic Captions for Open-World Detection CVPR 2024

Multimodal Prompt Perceiver: Empower Adaptiveness Generalizability and Fidelity for All-in-One Image Restoration CVPR 2024

MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant CVPR 2024

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More CVPR 2024