Papers
Benchmarking LLMs on Semantic Overlap Summarization
John Salvador, Naman Bansal, Mousumi Akter et al.
ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection
Jeonghye Kim, Sojeong Rhee, Minbeom Kim et al.
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Shudong Liu, Hongwei Liu, Junnan Liu et al.
A Knowledge-driven Adaptive Collaboration of LLMs for Enhancing Medical Decision-making
Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng et al.
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
Kinjal Basu, Ibrahim Abdelaziz, Kiran Kate et al.
DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context
Pramit Sahoo, Maharaj Brahma, Maunendra Sankar Desarkar
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
Mohammad Ramezanali, Mo Vazifeh, Paolo Santi
SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas
Anjiang Wei, Yuheng Wu, Yingjia Wan et al.
Personalized LLM Decoding via Contrasting Personal Preference
Hyungjune Bu, ChanJoo Jung, Minjae Kang et al.
MPCG: Multi-Round Persona-Conditioned Generation for Modeling the Evolution of Misinformation with LLMs
Chong Jun Rong Brian, Yixuan Tang, Anthony Kum Hoe Tung
Multi-LMentry: Can Multilingual LLMs Solve Elementary Tasks Across Languages?
Luca Moroni, Javier Aula-Blasco, Simone Conia et al.
EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs
Numaan Naeem, Abdellah El Mekki, Muhammad Abdul-Mageed
NitiBench: Benchmarking LLM Frameworks on Thai Legal Question Answering Capabilities
Pawitsapak Akarajaradwong, Pirat Pothavorn, Chompakorn Chaksangchaichot et al.
Conflicting Needles in a Haystack: How LLMs behave when faced with contradictory information
Murathan Kurfali, Robert Östling
Reward-Weighted Sampling: Enhancing Non-Autoregressive Characteristics in Masked Diffusion LLMs
Daehoon Gwak, Minseo Jung, Junwoo Park et al.
AI Argues Differently: Distinct Argumentative and Linguistic Patterns of LLMs in Persuasive Contexts
Esra Dönmez, Maximilian Maurer, Gabriella Lapesa et al.
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
Denis Janiak, Jakub Binkowski, Albert Sawczyn et al.
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
Boyang Zhang, Yicong Tan, Yun Shen et al.
Trojsten Benchmark: Evaluating LLM Problem-Solving in Slovak STEM Competition Problems
Adam Zahradník, Marek Suppa
A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs
Shaona Ghosh, Amrita Bhattacharjee, Yftah Ziser et al.
so much depends / upon / a whitespace: Why Whitespace Matters for Poets and LLMs
Sriharsh Bhyravajjula, Melanie Walsh, Anna Preus et al.
Certified Mitigation of Worst-Case LLM Copyright Infringement
Jingyu Zhang, Jiacan Yu, Marc Marone et al.
CourtReasoner: Can LLM Agents Reason Like Judges?
Sophia Simeng Han, Yoshiki Takashima, Shannon Zejiang Shen et al.
Retracing the Past: LLMs Emit Training Data When They Get Lost
Myeongseob Ko, Nikhil Reddy Billa, Adam Nguyen et al.
Table-LLM-Specialist: Language Model Specialists for Tables using Iterative Fine-tuning
Junjie Xing, Yeye He, Mengyu Zhou et al.