Papers
3,922 papers found
WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
Liangtao Lin, Jun Zheng, Haidong Wang
WebRollback: Enhancing Web Agents with Explicit Rollback Mechanisms
Zhisong Zhang, Tianqing Fang, Kaixin Ma et al.
What Breaks Knowledge Graph based RAG? Benchmarking and Empirical Insights into Reasoning under Incomplete Knowledge
Dongzhuoran Zhou, Yuqicheng Zhu, Xiaxia Wang et al.
What Does Infect Mean to Cardio? Investigating the Role of Clinical Specialty Data in Medical LLMs
Xinlan Yan, Di Wu, Yibin Lei et al.
What does Surprisal have to do with Information Status?
Andrew Thomas Dyer
What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
William Watson, Nicole Cho, Sumitra Ganesh et al.
What Matters to an LLM? Behavioral and Computational Evidences from Summarization
Yongxin Zhou, Changshun Wu, Philippe Mulhem et al.
What Really Matters for Table LLMs? A Meta-Evaluation of Model and Data Effects
Naihao Deng, Sheng Zhang, Henghui Zhu et al.
What’s Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning
Zhaotian Weng, Haoxuan Li, Xin Eric Wang et al.
What the Router Sees Matters: Funnel Pooling for Fast, Content Driven Expert Routing
Josef Pichlmeier, Sebastian Nicolas Mueller, Jakob Sturm et al.
When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation
Xunyi Jiang, Dingyi Chang, Julian McAuley et al.
When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
Abeer Badawi, Elahe Rahimi, Md Tahmid Rahman Laskar et al.
When Does Auxiliary Modality Matter in Solving Geometric Problems? A Comprehensive Study of Textual, Formal, and Visual Modalities
Hyuk Namgoong, Jeesu Jung, Yerim Han et al.
When Do Language Models Endorse Limitations on Human Rights Principles?
Keenan Samway, Miu Nicole Takagi, Rada Mihalcea et al.
When Flores Bloomz Wrong: Cross-Direction Contamination in Machine Translation Evaluation
David Tan, Pinzhen Chen, Josef Van Genabith et al.
When LLMs Annotate: Reliability Challenges in Low-Resource NLI
Solmaz Panahi, John Kelleher, Vasudevan Nedumpozhimana
When Meanings Meet: Investigating the Emergence and Quality of Shared Concept Spaces during Multilingual Language Model Training
Felicia Körner, Max Müller-Eberstein, Anna Korhonen et al.
When Multilingual Evaluation Assumptions Fail: Tokenization Effects Across Scripts
Manodyna K H, Luc De Nardi
When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
Zafir Shamsi, Nikhil Chekuru, Zachary Guzman et al.
When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Hasan Can Biyik, Libby Barak, Jing Peng et al.
When Speed Meets Intelligence: Scalable Conversational NER in an Ever-evolving World
Karim Ghonim, Antonio Roberto, Davide Bernardi
When the Model Said ‘No Comment’, We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified
Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
When Words Wear Masks: Detecting Malicious Intents and Hostile Impacts of Online Hate Speech
Priyansh Singhal, Piyush Joshi
Where Are We at with Automatic Speech Recognition for the Bambara Language?
Seydou Diallo, Yacouba Diarra, Panga Azazia Kamaté et al.