Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually AAAI 2024

DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance CVPR 2024

Analyzing Key Factors Influencing Emotion Prediction Performance of VLLMs in Conversational Contexts EMNLP 2024

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use CVPR 2024

Revisiting motion information for RGB-Event tracking with MOT philosophy NIPS 2024

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities EMNLP 2024

Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding EMNLP 2024

Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! EMNLP 2024

VIVA: A Benchmark for Vision-Grounded Decision-Making with Human Values EMNLP 2024

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld CVPR 2024

Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison EMNLP 2024

By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting EMNLP 2024

ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis CVPR 2024

Towards Low-Resource Harmful Meme Detection with LMM Agents EMNLP 2024

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation EMNLP 2024

Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels CVPR 2024

TCSinger: Zero-Shot Singing Voice Synthesis with Style Transfer and Multi-Level Style Control EMNLP 2024

LLMs are Good Action Recognizers CVPR 2024

Tag-grounded Visual Instruction Tuning with Retrieval Augmentation EMNLP 2024

RWKV-CLIP: A Robust Vision-Language Representation Learner EMNLP 2024

A Hierarchical Network for Multimodal Document-Level Relation Extraction AAAI 2024

TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering EMNLP 2024

Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector AAAI 2024

LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models EMNLP 2024

Deciphering Cognitive Distortions in Patient-Doctor Mental Health Conversations: A Multimodal LLM-Based Detection and Reasoning Framework EMNLP 2024