conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents ACL 2025

Pixel-Level Reasoning Segmentation via Multi-turn Conversations ACL 2025

Masking in Multi-hop QA: An Analysis of How Language Models Perform with Context Permutation ACL 2025

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs ACL 2025

InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training ACL 2025

Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues ACL 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference ACL 2025

LLaMA-Omni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis ACL 2025

Crowdsource, Crawl, or Generate? Creating SEA-VL, a Multicultural Vision-Language Dataset for Southeast Asia ACL 2025

Soundwave: Less is More for Speech-Text Alignment in LLMs ACL 2025

MemeQA: Holistic Evaluation for Meme Understanding ACL 2025

MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval ACL 2025

UniCodec: Unified Audio Codec with Single Domain-Adaptive Codebook ACL 2025

Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval ACL 2025

nvAgent: Automated Data Visualization from Natural Language via Collaborative Agent Workflow ACL 2025

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes ACL 2025

Adversarial Alignment with Anchor Dragging Drift (A3D2): Multimodal Domain Adaptation with Partially Shifted Modalities ACL 2025

Integrating Audio, Visual, and Semantic Information for Enhanced Multimodal Speaker Diarization on Multi-party Conversation ACL 2025

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? ACL 2025

V-Oracle: Making Progressive Reasoning in Deciphering Oracle Bones for You and Me ACL 2025

Error-driven Data-efficient Large Multimodal Model Tuning ACL 2025

Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback ACL 2025

Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times ACL 2025

AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations ACL 2025

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos ACL 2025