Papers
5,479 papers found
RiddleBench: A New Generative Reasoning Benchmark for LLMs
Deepon Halder, Alan Saji, Thanmay Jayakumar et al.
ExpressivityBench: Can LLMs Communicate Implicitly?
Joshua Tint, Som Sagar, Aditya Taparia et al.
SpARK: An Embarrassingly Simple Sparse Watermarking in LLMs with Enhanced Text Quality
Duy Cao Hoang, Thanh Quoc Hung Le, Rui Chu et al.
UniToolBench: A Benchmark for Tool-Augmented LLMs in Cross-Domain, Universal Task Automation
Xiaojie Guo, Yang Zhang, Bing Zhang et al.
Thunder-NUBench: A Benchmark for LLMs’ Sentence-Level Negation Understanding
Yeonkyoung So, Gyuseong Lee, Sungmok Jung et al.
What Makes a Good Query? Measuring the Impact of Human-Confusing Linguistic Features on LLM Performance
William Watson, Nicole Cho, Sumitra Ganesh et al.
Program-of-Thought Reveals LLM Abstraction Ceilings
Mike Zhou, Fenil Bardoliya, Vivek Gupta et al.
Show or Tell? Modeling the evolution of request-making in Human-LLM conversations
Shengqi Zhu, Jeffrey Rzeszotarski, David Mimno
Sparse Brains are Also Adaptive Brains: Cognitive-Load-Aware Dynamic Activation for LLMs
Yiheng Yang, Yujie Wang, Chi Ma et al.
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
Yuliang Yan, Haochun Tang, Shuo Yan et al.
Let’s Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification
Jingshen Zhang, Xin Ying Qiu, Lifang Lu et al.
LogToP: Logic Tree-of-Program with Table Instruction-tuned LLMs for Controlled Logical Table-to-Text Generation
Yupian Lin, Guangya Yu, Cheng Yuan et al.
DFPE: A Diverse Fingerprint Ensemble for Enhancing LLM Performance
Seffi Cohen, Nurit Cohen Inger, Niv Goldshlager et al.
Ranking Human and LLM Texts Using Locality Statistics
Yiyang Wang, Chen Ding, Hangfeng He
Improving the OOD Performance of Closed-Source LLMs on NLI Through Strategic Data Selection
Joe Stacey, Lisa Alazraki, Aran Ubhi et al.
Do LLMs model human linguistic variation? A case study in Hindi-English Verb code-mixing
Mukund Choudhary, Madhur Jindal, Gaurja Aeron et al.
FactSelfCheck: Fact-Level Black-Box Hallucination Detection for LLMs
Albert Sawczyn, Jakub Binkowski, Denis Janiak et al.
What Matters to an LLM? Behavioral and Computational Evidences from Summarization
Yongxin Zhou, Changshun Wu, Philippe Mulhem et al.
Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities
Manan Roy Choudhury, Adithya Chandramouli, Mannan Anand et al.
Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pre-training
Jeffrey Li, Joshua P Gardner, Doug Kang et al.
Can Models Help Us Create Better Models? Evaluating LLMs as Data Scientists
Michał Pietruszka, Łukasz Borchmann, Aleksander Jędrosz et al.
Argument-Based Consistency in Toxicity Explanations of LLMs
Ramaravind Kommiya Mothilal, Joanna Roy, Syed Ishtiaque Ahmed et al.
Quantifying Data Contamination in Psychometric Evaluations of LLMs
Jongwook Han, Woojung Song, Jonggeun Lee et al.
How to Contextualize Empirical Data for Risk Analysis with LLMs: A Case Study of Power Outages
Haiyun Huang, Yukun Li, Marco A Pretell et al.
Thinking Beyond the Local: Multi-View Instructed Adaptive Reasoning in KG-Enhanced LLMs
Minghan Zhang, Shu Zhao, Zhen Yang et al.