Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare NIPS 2024

Dense Connector for MLLMs NIPS 2024

Ada-MSHyper: Adaptive Multi-Scale Hypergraph Transformer for Time Series Forecasting NIPS 2024

Learning Spatially-Aware Language and Audio Embeddings NIPS 2024

An eye for an ear: zero-shot audio description leveraging an image captioner with audio-visual token distribution matching NIPS 2024

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models NIPS 2024

Multi-Object Hallucination in Vision Language Models NIPS 2024

PLIP: Language-Image Pre-training for Person Representation Learning NIPS 2024

MARVEL: Multidimensional Abstraction and Reasoning through Visual Evaluation and Learning NIPS 2024

Grasp as You Say: Language-guided Dexterous Grasp Generation NIPS 2024

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions NIPS 2024

WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences NIPS 2024

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models NIPS 2024

MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations NIPS 2024

Advancing Cross-domain Discriminability in Continual Learning of Vision-Language Models NIPS 2024

Why are Visually-Grounded Language Models Bad at Image Classification? NIPS 2024

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments NIPS 2024

G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models NIPS 2024

ChatCam: Empowering Camera Control through Conversational AI NIPS 2024

Towards Robust Multimodal Sentiment Analysis with Incomplete Data NIPS 2024

Vript: A Video Is Worth Thousands of Words NIPS 2024

Aligning Audio-Visual Joint Representations with an Agentic Workflow NIPS 2024

Facilitating Multimodal Classification via Dynamically Learning Modality Gap NIPS 2024

Boosting Vision-Language Models with Transduction NIPS 2024

Terra: A Multimodal Spatio-Temporal Dataset Spanning the Earth NIPS 2024