conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning ACL 2025

Recent Advances in Speech Language Models: A Survey ACL 2025

MM-Verify: Enhancing Multimodal Reasoning with Chain-of-Thought Verification ACL 2025

Investigating and Enhancing the Robustness of Large Multimodal Models Against Temporal Inconsistency ACL 2025

Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning ACL 2025

Can MLLMs Understand the Deep Implication Behind Chinese Images? ACL 2025

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation ACL 2025

EAGLE: Expert-Guided Self-Enhancement for Preference Alignment in Pathology Large Vision-Language Model ACL 2025

RSVP: Reasoning Segmentation via Visual Prompting and Multi-modal Chain-of-Thought ACL 2025

Can Vision-Language Models Evaluate Handwritten Math? ACL 2025

HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States ACL 2025

CART: A Generative Cross-Modal Retrieval Framework With Coarse-To-Fine Semantic Modeling ACL 2025

LLaVA Steering: Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering ACL 2025

Predicting Turn-Taking and Backchannel in Human-Machine Conversations Using Linguistic, Acoustic, and Visual Signals ACL 2025

iQUEST: An Iterative Question-Guided Framework for Knowledge Base Question Answering ACL 2025

Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment ACL 2025

VLMInferSlow: Evaluating the Efficiency Robustness of Large Vision-Language Models as a Service ACL 2025

CrisisTS: Coupling Social Media Textual Data and Meteorological Time Series for Urgency Classification ACL 2025

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching ACL 2025

MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration ACL 2025

SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation ACL 2025

VoxEval: Benchmarking the Knowledge Understanding Capabilities of End-to-End Spoken Language Models ACL 2025

Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs ACL 2025

Movie101v2: Improved Movie Narration Benchmark ACL 2025

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation ACL 2025