← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

The Multimodal Universe: Enabling Large-Scale Machine Learning with 100 TB of Astronomical Scientific Data NIPS 2024

EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection NIPS 2024

Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization NIPS 2024

HEST-1k: A Dataset For Spatial Transcriptomics and Histology Image Analysis NIPS 2024

E2E-MFD: Towards End-to-End Synchronous Multimodal Fusion Detection NIPS 2024

Calibrated Self-Rewarding Vision Language Models NIPS 2024

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions NIPS 2024

PLIP: Language-Image Pre-training for Person Representation Learning NIPS 2024

MoME: Mixture of Multimodal Experts for Generalist Multimodal Large Language Models NIPS 2024

Physics-Regularized Multi-Modal Image Assimilation for Brain Tumor Localization NIPS 2024

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models NIPS 2024

An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching NIPS 2024

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens NIPS 2024

FiVA: Fine-grained Visual Attribute Dataset for Text-to-Image Diffusion Models NIPS 2024

DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection NIPS 2024

CLIP in Mirror: Disentangling text from visual images through reflection NIPS 2024

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs NIPS 2024

DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification WACV 2024

MolTC: Towards Molecular Relational Modeling In Language Models ACL 2024

Open-World Human-Object Interaction Detection via Multi-modal Prompts CVPR 2024

Unified Generative and Discriminative Training for Multi-modal Large Language Models NIPS 2024

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning NIPS 2024

Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model NIPS 2024

Locating What You Need: Towards Adapting Diffusion Models to OOD Concepts In-the-Wild NIPS 2024

SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset NIPS 2024