Papers
Detecting LLM Hallucination Through Layer-wise Information Deficiency: Analysis of Ambiguous Prompts and Unanswerable Questions
Hazel Kim, Tom A. Lamb, Adel Bibi et al.
MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses
Tong Chen, Zimu Wang, Yiyi Miao et al.
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment
Yogesh Kulkarni, Pooyan Fazli
Do LLMs Adhere to Label Definitions? Examining Their Receptivity to External Label Definitions
Seyedali Mohammadi, Bhaskara Hanuma Vedula, Hemank Lamba et al.
The Impact of Language Mixing on Bilingual LLM Reasoning
Yihao Li, Jiayi Xin, Miranda Muqing Miao et al.
Batched Self-Consistency Improves LLM Relevance Assessment and Ranking
Anton Korikov, Pan Du, Scott Sanner et al.
Analyzing and Modeling LLM Response Lengths with Extreme Value Theory: Anchoring Effects and Hybrid Distributions
Liuxuan Jiao, Chen Gao, Yiqian Yang et al.
Benchmarking LLMs for Translating Classical Chinese Poetry: Evaluating Adequacy, Fluency, and Elegance
Andong Chen, Lianzhang Lou, Kehai Chen et al.
MemInsight: Autonomous Memory Augmentation for LLM Agents
Rana Salama, Jason Cai, Michelle Yuan et al.
No Need for Explanations: LLMs can implicitly learn from mistakes in-context
Lisa Alazraki, Maximilian Mozes, Jon Ander Campos et al.
Revealing and Mitigating the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing
Wenyuan Zhang, Shuaiyi Nie, Jiawei Sheng et al.
Benchmarking LLMs on Semantic Overlap Summarization
John Salvador, Naman Bansal, Mousumi Akter et al.
ReflAct: World-Grounded Decision Making in LLM Agents via Goal-State Reflection
Jeonghye Kim, Sojeong Rhee, Minbeom Kim et al.
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward
Shudong Liu, Hongwei Liu, Junnan Liu et al.
A Knowledge-driven Adaptive Collaboration of LLMs for Enhancing Medical Decision-making
Xiao Wu, Ting-Zhu Huang, Liang-Jian Deng et al.
NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
Kinjal Basu, Ibrahim Abdelaziz, Kiran Kate et al.
DIWALI - Diversity and Inclusivity aWare cuLture specific Items for India: Dataset and Assessment of LLMs for Cultural Text Adaptation in Indian Context
Pramit Sahoo, Maharaj Brahma, Maunendra Sankar Desarkar
seqBench: A Tunable Benchmark to Quantify Sequential Reasoning Limits of LLMs
Mohammad Ramezanali, Mo Vazifeh, Paolo Santi
SATBench: Benchmarking LLMs’ Logical Reasoning via Automated Puzzle Generation from SAT Formulas
Anjiang Wei, Yuheng Wu, Yingjia Wan et al.
Personalized LLM Decoding via Contrasting Personal Preference
Hyungjune Bu, ChanJoo Jung, Minjae Kang et al.
MPCG: Multi-Round Persona-Conditioned Generation for Modeling the Evolution of Misinformation with LLMs
Chong Jun Rong Brian, Yixuan Tang, Anthony Kum Hoe Tung
Multi-LMentry: Can Multilingual LLMs Solve Elementary Tasks Across Languages?
Luca Moroni, Javier Aula-Blasco, Simone Conia et al.
EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs
Numaan Naeem, Abdellah El Mekki, Muhammad Abdul-Mageed
NitiBench: Benchmarking LLM Frameworks on Thai Legal Question Answering Capabilities
Pawitsapak Akarajaradwong, Pirat Pothavorn, Chompakorn Chaksangchaichot et al.
Conflicting Needles in a Haystack: How LLMs behave when faced with contradictory information
Murathan Kurfali, Robert Östling