Papers
What Question Did You Answer? Refining Contact Center Evaluation Plans via Backward Questions
Prajwal Sood, Rushikesh Pawar, Digvijay Anil Ingle et al.
What’s Left Unsaid? Detecting and Correcting Misleading Omissions in Multimodal News Previews
Fanxiao Li, Jiaying Wu, Tingchao Fu et al.
When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors
Chenghao Yang, Yuning Zhang, Zhoufutu Wen et al.
When Background Matters: Breaking Medical Vision Language Models by Transferable Attack
Akash Ghosh, Subhadip Baidya, Sriparna Saha et al.
When Benchmarks Leak: Inference-Time Decontamination for LLMs
Jianzhe Chai, YU Zhe, Jun Sakuma
When Bigger Isn’t Better: A Comprehensive Fairness Evaluation of Political Bias in Multi-News Summarisation
Nannan Huang, Iffat Maab, Junichi Yamagishi
When Correct Beliefs Collapse: Epistemic Resilience of LLMs under Clinical Pressure
Boyu Xiao, Xiuqi Tian, Xuwen Song et al.
When "Correct" Is Not Safe: Can We Trust Functionally Correct Patches Generated by Code Agents?
Yibo Peng, James Song, Lei Li et al.
When Does Language Matter? Multilingual Instructions Reveal Step-wise Language Sensitivity in Vision-Language-Action Models
Xuan Dong, Zhe Han, Tianhao Niu et al.
When Does Mixing Help? Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval
Tongyao Zhu, Huang Chao Ming, Min-Yen Kan
When Efficiency Becomes a Vulnerability: Computational Cost Attacks on WebAgents
Liang-Bo Ning, Yuchen Zhu, Heqing Huang et al.
When Efficiency Meets Safety: A Benchmark Security Analysis of KV Cache Compression in Large Language Models
Xiaoxiao Ma, Kuofeng Gao, Zeyi Lu et al.
When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation
Lin Sun, Wangdexian, Jingang Huang et al.
When High Accuracy Hides Poor Calibration: Rethinking Confidence Evaluation in Transformer-Based Text Classification with Balanced Brier Score
Guilherme Fonseca, Gabriel Prenassi, Washington Cunha et al.
When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning
Hyeong Kyu Choi, Jerry Zhu, Sharon Li
When in Doubt, Consult: Expert Debate for Sexism Detection via Confidence-Based Routing
Anwar Alajmi, Gabriele Pergola
When Is Thinking Enough? Early Exit via Sufficiency Assessment for Efficient Reasoning
Yang Xiang, Yixin Ji, Ruotao Xu et al.
When KV Cache Reuse Fails in Multi-Agent Systems: Cross-Candidate Interaction is Crucial for LLM Judges
Sichu Liang, Zhenglin Wang, Chujiajia et al.
When LLMs Read Tables Carelessly: Measuring and Reducing Data Referencing Errors
Yuqing Yang, Qi Zhu, Zhen Han et al.
When Misinformation Speaks and Converses: Rethinking Fact-Checking in Audio Platforms
Chaewan Chun, Delvin Ce Zhang, Dongwon Lee
When Models Hesitate: Answer Instability as a Label-Free Uncertainty Signal for LLMs
Jasper Meynard Arana, Kristine Ann M. Carandang, Ethan Robert Casin et al.
When More Words Say Less: Decoupling Length and Specificity in Image Description Evaluation
Rhea Kapur, Robert D. Hawkins, Elisa Kreiss
When Morphology Hides in Plain Sight: Breaking the Isolation in Vietnamese and Beyond
Anh Trac Duc Dinh, Khang Hoang Nhat Vo, Tai Tien Ta et al.
When One LLM Drools, Multi-LLM Collaboration Rules
Shangbin Feng, Wenxuan Ding, Alisa Liu et al.
When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents
Jiahe Guo, Xiangran Guo, Yulin Hu et al.