language model evaluation

221 papers

Explore in graph

Also known as

LLM EVALUATION LME LLM LMS

Co-occurring keywords

large language model (12755) language model (4573) benchmark evaluation (1539) multilingual nlp (1423) natural language understanding (845) text generation (2903) low-resource language (2234) evaluation benchmark (250) text classification (6776) question answering (2904)

Papers

HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants NAACL 2024

IndiBias: A Benchmark Dataset to Measure Social Biases in Language Models for Indian Context NAACL 2024

GeminiPro at SemEval-2024 Task 9: BrainTeaser on Gemini NAACL 2024

SemEval-2024 Task 2: Safe Biomedical Natural Language Inference for Clinical Trials NAACL 2024

A Closer Look at Claim Decomposition NAACL 2024

Discovering Language Model Behaviors with Model-Written Evaluations ACL 2023

bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark ACL 2023

Evaluating the Rationale Understanding of Critical Reasoning in Logical Reading Comprehension EMNLP 2023

ChatGPT Beyond English: Towards a Comprehensive Evaluation of Large Language Models in Multilingual Learning EMNLP 2023

Can Language Models Be Specific? How? ACL 2023

LeXFiles and LegalLAMA: Facilitating English Multinational Legal Language Model Development ACL 2023

DUMB: A Benchmark for Smart Evaluation of Dutch Models EMNLP 2023

MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation EMNLP 2023

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation EMNLP 2023

Building Real-World Meeting Summarization Systems using Large Language Models: A Practical Perspective EMNLP 2023

Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing EMNLP 2023

Pseudointelligence: A Unifying Lens on Language Model Evaluation EMNLP 2023

SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research EMNLP 2023

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation EMNLP 2023

Retrieval-based Evaluation for LLMs: A Case Study in Korean Legal QA EMNLP 2023

Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution EMNLP 2023

OpenICL: An Open-Source Framework for In-context Learning ACL 2023

Jamp: Controlled Japanese Temporal Inference Dataset for Evaluating Generalization Capacity of Language Models ACL 2023

UINAUIL: A Unified Benchmark for Italian Natural Language Understanding ACL 2023

Language models are not naysayers: an analysis of language models on negation benchmarks ACL 2023