Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

MMToM-QA: Multimodal Theory of Mind Question Answering ACL 2024

Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction ACL 2024

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models ACL 2024

Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability ACL 2024

Generating Human Motion in 3D Scenes from Text Descriptions CVPR 2024

Evaluating Very Long-Term Conversational Memory of LLM Agents ACL 2024

Learning to Decode Collaboratively with Multiple Language Models ACL 2024

Visual Hallucinations of Multi-modal Large Language Models ACL 2024

XLAVS-R: Cross-Lingual Audio-Visual Speech Representation Learning for Noise-Robust Speech Perception ACL 2024

ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models ACL 2024

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models ACL 2024

Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models ACL 2024

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation CVPR 2024

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks CVPR 2024

Can't Make an Omelette Without Breaking Some Eggs: Plausible Action Anticipation Using Large Video-Language Models CVPR 2024

Diving Deep into the Motion Representation of Video-Text Models ACL 2024

Mask Grounding for Referring Image Segmentation CVPR 2024

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark ACL 2024

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition CVPR 2024

NTO3D: Neural Target Object 3D Reconstruction with Segment Anything CVPR 2024

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks CVPR 2024

A Vision Check-up for Language Models CVPR 2024

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations CVPR 2024

DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback CVPR 2024

BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics CVPR 2024