Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Mercury: A Code Efficiency Benchmark for Code Large Language Models
NIPS 2024
On the Rigour of Scientific Writing: Criteria, Analysis, and Insights
EMNLP 2024
RepMatch: Quantifying Cross-Instance Similarities in Representation Space
EMNLP 2024
Classifier Calibration with ROC-Regularized Isotonic Regression
AISTATS 2024
Analyzing Explainer Robustness via Probabilistic Lipschitzness of Prediction Functions
AISTATS 2024
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries
ACL 2024
Auditing Local Explanations is Hard
NIPS 2024
Categorical Generative Model Evaluation via Synthetic Distribution Coarsening
AISTATS 2024
Understanding Generalization of Federated Learning via Stability: Heterogeneity Matters
AISTATS 2024
QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
EMNLP 2024
“A good pun is its own reword”: Can Large Language Models Understand Puns?
EMNLP 2024
Overcoming Common Flaws in the Evaluation of Selective Classification Systems
NIPS 2024
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
EMNLP 2024
Humans or LLMs as the Judge? A Study on Judgement Bias
EMNLP 2024
Differentially Private Equivalence Testing for Continuous Distributions and Applications
NIPS 2024
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
NIPS 2024
On the Content Bias in Frechet Video Distance
CVPR 2024
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
EMNLP 2024
LawBench: Benchmarking Legal Knowledge of Large Language Models
EMNLP 2024
Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives
ACL 2024
Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
ACL 2024
FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability
ACL 2024
PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
EMNLP 2024
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
EMNLP 2024
InterrogateLLM: Zero-Resource Hallucination Detection in LLM-Generated Answers
ACL 2024
<
1
…
26
27
28
…
67
>