Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Categorical Generative Model Evaluation via Synthetic Distribution Coarsening
AISTATS 2024
Understanding Generalization of Federated Learning via Stability: Heterogeneity Matters
AISTATS 2024
QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
EMNLP 2024
“A good pun is its own reword”: Can Large Language Models Understand Puns?
EMNLP 2024
Unexpected Phenomenon: LLMs’ Spurious Associations in Information Extraction
ACL 2024
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
EMNLP 2024
Humans or LLMs as the Judge? A Study on Judgement Bias
EMNLP 2024
Rationales for Answers to Simple Math Word Problems Confuse Large Language Models
ACL 2024
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
EMNLP 2024
LawBench: Benchmarking Legal Knowledge of Large Language Models
EMNLP 2024
StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code
ACL 2024
Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models
ACL 2024
Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness
ACL 2024
PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
EMNLP 2024
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
EMNLP 2024
ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution
ACL 2024
Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability
AAAI 2024
Understanding and Mitigating Language Confusion in LLMs
EMNLP 2024
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models
EMNLP 2024
On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods
AAAI 2024
FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability
ACL 2024
Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates
EMNLP 2024
Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks
AAAI 2024
ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
EMNLP 2024
Every Answer Matters: Evaluating Commonsense with Probabilistic Measures
ACL 2024
<
1
…
24
25
26
…
67
>