← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

Physics-Regularized Multi-Modal Image Assimilation for Brain Tumor Localization NIPS 2024

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models NIPS 2024

An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching NIPS 2024

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens NIPS 2024

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models NIPS 2024

DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection NIPS 2024

CLIP in Mirror: Disentangling text from visual images through reflection NIPS 2024

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs NIPS 2024

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification WACV 2024

OneLLM: One Framework to Align All Modalities with Language CVPR 2024

Unified Generative and Discriminative Training for Multi-modal Large Language Models NIPS 2024

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning NIPS 2024

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model NIPS 2024

Locating What You Need: Towards Adapting Diffusion Models to OOD Concepts In-the-Wild NIPS 2024

SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset NIPS 2024

G2D: From Global to Dense Radiography Representation Learning via Vision-Language Pre-training NIPS 2024

Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning NIPS 2024

Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling NIPS 2024

Unified Insights: Harnessing Multi-modal Data for Phenotype Imputation via View Decoupling NIPS 2024

Unified Lexical Representation for Interpretable Visual-Language Alignment NIPS 2024

FIRE: Food Image to REcipe Generation WACV 2024

Egocentric Action Recognition by Capturing Hand-Object Contact and Object State WACV 2024

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding CVPR 2024

Speechworthy Instruction-tuned Language Models EMNLP 2024

Posture-Informed Muscular Force Learning for Robust Hand Pressure Estimation NIPS 2024