Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Can AI Make Us Laugh? Comparing Jokes Generated by Witscript and a Human Expert
COLING 2025
Do not Abstain! Identify and Solve the Uncertainty
ACL 2025
ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments
EMNLP 2025
LLM-based post-editing as reference-free GEC evaluation
ACL 2025
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
NAACL 2025
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles
EMNLP 2025
Evaluating Numeracy of Language Models as a Natural Language Inference Task
NAACL 2025
RCScore: Quantifying Response Consistency in Large Language Models
EMNLP 2025
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?
ACL 2025
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
EMNLP 2025
Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities
COLING 2025
Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
EMNLP 2025
Don’t Score too Early! Evaluating Argument Mining Models on Incomplete Essays
ACL 2025
NOVA-63: Native Omni-lingual Versatile Assessments of 63 Disciplines
EMNLP 2025
MultiConIR: Towards Multi-Condition Information Retrieval
EMNLP 2025
Confounding Factors in Relating Model Performance to Morphology
EMNLP 2025
Comparing Behavioral Patterns of LLM and Human Tutors: A Population-level Analysis with the CIMA Dataset
ACL 2025
UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models
EMNLP 2025
Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting
COLING 2025
PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization
EMNLP 2025
Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors
ACL 2025
We Need to Measure Data Diversity in NLP — Better and Broader
EMNLP 2025
Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning
NAACL 2025
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs’ Responsiveness to Human Feedback
EMNLP 2025
Stop Playing the Guessing Game! Evaluating Conversational Recommender Systems via Target-free User Simulation
EMNLP 2025
<
1
…
15
16
17
…
67
>