Research Explorer

LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs

Do Xuan Long, Ngoc-Hai Nguyen, Tiviatis Sim et al.

2025 NAACL

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs

Arash Gholami Davoodi, Seyed Pouyan Mousavi Davoudi, Pouya Pezeshkpour

2025 NAACL

ALPACA AGAINST VICUNA: Using LLMs to Uncover Memorization of LLMs

Aly M. Kassem, Omar Mahmoud, Niloofar Mireshghallah et al.

2025 NAACL

How Good Are LLMs for Literary Translation, Really? Literary Translation Evaluation with Humans and LLMs

Ran Zhang, Wei Zhao, Steffen Eger

2025 NAACL

Understanding LLM Development Through Longitudinal Study: Insights from the Open Ko-LLM Leaderboard

Chanjun Park, Hyeonwoo Kim

2025 NAACL

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

Seungbeen Lee, Seungwon Lim, Seungju Han et al.

2025 NAACL

LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation

Himel Ghosh, Nick Elias Werner

2026 EACL

LLMs as Span Annotators: A Comparative Study of LLMs and Humans

Zdeněk Kasner, Vilém Zouhar, Patrícia Schmidtová et al.

2026 EACL

REACT-LLM: A Benchmark for Evaluating LLM Integration with Causal Features in Clinical Prognostic Tasks

Linna Wang, Zhixuan You, Qihui Zhang et al.

2026 AAAI

Towards Effective Offensive Security LLM Agents: Hyperparameter Tuning, LLM as a Judge, and a Lightweight CTF Benchmark

Minghao Shao, Nanda Rani, Kimberly Milner et al.

2026 AAAI

ExPairT-LLM: Exact Learning for LLM Code Selection by Pairwise Queries

Tom Yuviler, Dana Drachsler-Cohen

2026 AAAI

When the Domain Expert Has No Time and the LLM Developer Has No Clinical Expertise: Real-World Lessons from LLM Co-Design in a Safety-Net Hospital

Avni Kothari, Patrick Vossler, Jean Digitale et al.

2026 AAAI

OpenAGI: When LLM Meets Domain Experts

Yingqiang Ge, Wenyue Hua, Kai Mei et al.

2023 NIPS

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman et al.

2023 NIPS

VPGTrans: Transfer Visual Prompt Generator across LLMs

Ao Zhang, Hao Fei, Yuan Yao et al.

2023 NIPS

3D-LLM: Injecting the 3D World into Large Language Models

Yining Hong, Haoyu Zhen, Peihao Chen et al.

2023 NIPS

The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters for Implicature Resolution by LLMs

Laura Ruis, Akbir Khan, Stella Biderman et al.

2023 NIPS

LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation

Yujie Lu, Xianjun Yang, Xiujun Li et al.

2023 NIPS

BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset

Jiaming Ji, Mickel Liu, Josef Dai et al.

2023 NIPS

Free-Bloom: Zero-Shot Text-to-Video Generator with LLM Director and LDM Animator

Hanzhuo Huang, Yufan Feng, Cheng Shi et al.

2023 NIPS

Symbol-LLM: Leverage Language Models for Symbolic System in Visual Human Activity Reasoning

Xiaoqian Wu, Yong-Lu Li, Jianhua Sun et al.

2023 NIPS

Describe, Explain, Plan and Select: Interactive Planning with LLMs Enables Open-World Multi-Task Agents

Zihao Wang, Shaofei Cai, Guanzhou Chen et al.

2023 NIPS

Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs

Jinyang Li, Binyuan Hui, Ge Qu et al.

2023 NIPS

ToolQA: A Dataset for LLM Question Answering with External Tools

Yuchen Zhuang, Yue Yu, Kuan Wang et al.

2023 NIPS

Evaluating the Moral Beliefs Encoded in LLMs

Nino Scherrer, Claudia Shi, Amir Feder et al.

2023 NIPS

Papers