Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
ECON: On the Detection and Resolution of Evidence Conflicts
EMNLP 2024
Do Large Language Models Know How Much They Know?
EMNLP 2024
I am a Strange Dataset: Metalinguistic Tests for Language Models
ACL 2024
On the Reliability of Psychological Scales on Large Language Models
EMNLP 2024
CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios
EMNLP 2024
Monoculture in Matching Markets
NIPS 2024
PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining
NIPS 2024
An Analysis of Multilingual FActScore
EMNLP 2024
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
NIPS 2024
When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives
EMNLP 2024
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
EMNLP 2024
WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking
NIPS 2024
FOOL ME IF YOU CAN! An Adversarial Dataset to Investigate the Robustness of LMs in Word Sense Disambiguation
EMNLP 2024
Beyond Reference: Evaluating High Quality Translations Better than Human References
EMNLP 2024
Are Language Models Actually Useful for Time Series Forecasting?
NIPS 2024
Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations
EACL 2024
Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents
ACL 2024
An Audit on the Perspectives and Challenges of Hallucinations in NLP
EMNLP 2024
Efficient LLM Comparative Assessment: A Product of Experts Framework for Pairwise Comparisons
EMNLP 2024
Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity
EACL 2024
Assessing “Implicit” Retrieval Robustness of Large Language Models
EMNLP 2024
LUQ: Long-text Uncertainty Quantification for LLMs
EMNLP 2024
Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
EMNLP 2024
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method
EMNLP 2024
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions
EMNLP 2024
<
1
…
27
28
29
…
67
>