Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Efficient Lifelong Model Evaluation in an Era of Rapid Progress
NIPS 2024
kGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution
NIPS 2024
AlignBench: Benchmarking Chinese Alignment of Large Language Models
ACL 2024
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
ACL 2024
Likelihood-based Mitigation of Evaluation Bias in Large Language Models
ACL 2024
An engine not a camera: Measuring performative power of online search
NIPS 2024
Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends
ACL 2024
Conformalized Multiple Testing after Data-dependent Selection
NIPS 2024
Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks
ACL 2024
Rethinking LLM Memorization through the Lens of Adversarial Compression
NIPS 2024
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
ACL 2024
Questioning the Survey Responses of Large Language Models
NIPS 2024
An Empirical Analysis on Large Language Models in Debate Evaluation
ACL 2024
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
ACL 2024
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration
ACL 2024
Towards Human-AI Complementarity with Prediction Sets
NIPS 2024
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models
NIPS 2024
TaskBench: Benchmarking Large Language Models for Task Automation
NIPS 2024
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries
ACL 2024
Fine-Tuned Machine Translation Metrics Struggle in Unseen Domains
ACL 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
ACL 2024
ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution
ACL 2024
Rationales for Answers to Simple Math Word Problems Confuse Large Language Models
ACL 2024
GPT is Not an Annotator: The Necessity of Human Annotation in Fairness Benchmark Construction
ACL 2024
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
ACL 2024
<
1
…
31
32
33
…
67
>