multimodal learning

4622 papers

Explore in graph

Also known as

VLM VLLM MM VLA MLLMS MLM MML MULLM LMM MLLM MMT

Co-occurring keywords

large language model (12755) vision-language model (2235) visual question answering (1000) video understanding (1647) multi-modal learning (1276) contrastive learning (3979) representation learning (6174) transfer learning (5442) zero-shot learning (3637) vision language model (752)

Papers

Prepending or Cross-Attention for Speech-to-Text? An Empirical Comparison NAACL 2025

ViLU: Learning Vision-Language Uncertainties for Failure Prediction ICCV 2025

Everything is a Video: Unifying Modalities through Next-Frame Prediction ICCV 2025

Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision AAAI 2025

Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector CVPR 2025

A Wander Through the Multimodal Landscape: Efficient Transfer Learning via Low-rank Sequence Multimodal Adapter AAAI 2025

Towards Video Thinking Test: A Holistic Benchmark for Advanced Video Reasoning and Understanding ICCV 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality ICCV 2025

DAMPER: A Dual-Stage Medical Report Generation Framework with Coarse-Grained MeSH Alignment and Fine-Grained Hypergraph Matching AAAI 2025

AIMA at SemEval-2025 Task 1: Bridging Text and Image for Idiomatic Knowledge Extraction via Mixture of Experts ACL 2025

Understanding Co-speech Gestures in-the-wild ICCV 2025

Driving by the Rules: A Benchmark for Integrating Traffic Sign Regulations into Vectorized HD Map CVPR 2025

Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models CVPR 2025

Understanding Figurative Meaning through Explainable Visual Entailment NAACL 2025

DecepBench: Benchmarking Multimodal Deception Detection ACL 2025

DREAM: Improving Video-Text Retrieval Through Relevance-Based Augmentation Using Large Foundation Models NAACL 2025

GestureHYDRA: Semantic Co-speech Gesture Synthesis via Hybrid Modality Diffusion Transformer and Cascaded-Synchronized Retrieval-Augmented Generation ICCV 2025

AutoProteinEngine: A Large Language Model Driven Agent Framework for Multimodal AutoML in Protein Engineering COLING 2025

RCTrans: Radar-Camera Transformer via Radar Densifier and Sequential Decoder for 3D Object Detection AAAI 2025

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation CVPR 2025

IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification CVPR 2025

Pilot: Building the Federated Multimodal Instruction Tuning Framework AAAI 2025

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning ICCV 2025

SMoLoRA: Exploring and Defying Dual Catastrophic Forgetting in Continual Visual Instruction Tuning ICCV 2025

Adaptive Prompt-Based Semantic Embedding with Inspire Potential of Implicit Knowledge for Cross-Modal Retrieval AAAI 2025