Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
On A Scale From 1 to 5: Quantifying Hallucination in Faithfulness Evaluation
NAACL 2025
Hard Emotion Test Evaluation Sets for Language Models
NAACL 2025
UCL-Bench: A Chinese User-Centric Legal Benchmark for Large Language Models
NAACL 2025
Evaluating Numeracy of Language Models as a Natural Language Inference Task
NAACL 2025
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
NAACL 2025
Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
NAACL 2025
Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation
NAACL 2025
Towards Region-aware Bias Evaluation Metrics
NAACL 2025
Assessing the Reliability and Validity of GPT-4 in Annotating Emotion Appraisal Ratings
NAACL 2025
Measuring Mental Health Variables in Computational Research: Toward Validated, Dimensional, and Transdiagnostic Approaches
NAACL 2025
An Analysis of Scoring Methods for Reranking in Large Language Model Story Generation
NAACL 2025
Does Training on Synthetic Data Make Models Less Robust?
NAACL 2025
Exploring Limitations of LLM Capabilities with Multi-Problem Evaluation
NAACL 2025
Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning
NAACL 2025
UTER: Capturing the Human Touch in Evaluating Morphologically Rich and Low-Resource Languages
NAACL 2025
Analyzing Large Language Models’ pastiche ability: a case study on a 20th century Romanian author
NAACL 2025
Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries
NAACL 2025
Difficulty Estimation in Natural Language Tasks with Action Scores
NAACL 2025
Defining and Quantifying Visual Hallucinations in Vision-Language Models
NAACL 2025
Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance
NAACL 2025
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation
NAACL 2025
A Theoretical Framework for Evaluating Narrative Surprise in Large Language Models
NAACL 2025
CHATTER: A character-attribution dataset for narrative understanding
NAACL 2025
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
EMNLP 2025
MCQFormatBench: Robustness Tests for Multiple-Choice Questions
ACL 2025
<
1
…
20
21
22
…
67
>