← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Efficient Lifelong Model Evaluation in an Era of Rapid Progress NIPS 2024

kGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution NIPS 2024

AlignBench: Benchmarking Chinese Alignment of Large Language Models ACL 2024

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning ACL 2024

Likelihood-based Mitigation of Evaluation Bias in Large Language Models ACL 2024

An engine not a camera: Measuring performative power of online search NIPS 2024

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends ACL 2024

Conformalized Multiple Testing after Data-dependent Selection NIPS 2024

Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks ACL 2024

Rethinking LLM Memorization through the Lens of Adversarial Compression NIPS 2024

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic ACL 2024

Questioning the Survey Responses of Large Language Models NIPS 2024

An Empirical Analysis on Large Language Models in Debate Evaluation ACL 2024

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark ACL 2024

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration ACL 2024

Towards Human-AI Complementarity with Prediction Sets NIPS 2024

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models NIPS 2024

TaskBench: Benchmarking Large Language Models for Task Automation NIPS 2024

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries ACL 2024

Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains ACL 2024

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models ACL 2024

ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution ACL 2024

Rationales for Answers to Simple Math Word Problems Confuse Large Language Models ACL 2024

GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction ACL 2024

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models ACL 2024