Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation CVPR 2024

Collecting High-quality Multi-modal Conversational Search Data for E-Commerce ACL 2024

GLaMM: Pixel Grounding Large Multimodal Model CVPR 2024

ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations CVPR 2024

Real-Time Simulated Avatar from Head-Mounted Sensors CVPR 2024

Discovering Syntactic Interaction Clues for Human-Object Interaction Detection CVPR 2024

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark ACL 2024

CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning IJCAI 2024

Inter-X: Towards Versatile Human-Human Interaction Analysis CVPR 2024

DiaLoc: An Iterative Approach to Embodied Dialog Localization CVPR 2024

Semantics-aware Motion Retargeting with Vision-Language Models CVPR 2024

Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models ACL 2024

Multi-modal Stance Detection: New Datasets and Model ACL 2024

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models CVPR 2024

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction ACL 2024

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models ACL 2024

Evaluating Very Long-Term Conversational Memory of LLM Agents ACL 2024

Learning to Decode Collaboratively with Multiple Language Models ACL 2024

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception ACL 2024

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models ACL 2024

CLASP: Cross-modal Alignment Using Pre-trained Unimodal Models ACL 2024

MM-LLMs: Recent Advances in MultiModal Large Language Models ACL 2024

UniHuman: A Unified Model For Editing Human Images in the Wild CVPR 2024

DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model CVPR 2024

WonderJourney: Going from Anywhere to Everywhere CVPR 2024