Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Can Large Language Models Win the International Mathematical Games?
EMNLP 2025
Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations
WACV 2025
Noise-Aware Evaluation of Object Detectors
WACV 2025
Frame by Familiar Frame: Understanding Replication in Video Diffusion Models
WACV 2025
A Rapid Test for Accuracy and Bias of Face Recognition Technology
WACV 2025
Calibrating LLM Confidence by Probing Perturbed Representation Stability
EMNLP 2025
LLMs cannot spot math errors, even when allowed to peek into the solution
EMNLP 2025
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
ACL 2025
Long-Form Information Alignment Evaluation Beyond Atomic Facts
EMNLP 2025
Benchmarking AI Text Detection: Assessing Detectors Against New Datasets, Evasion Tactics, and Enhanced LLMs
COLING 2025
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?
EMNLP 2025
Bridging AI and Carbon Capture: A Dataset for LLMs in Ionic Liquids and CBE Research
ACL 2025
Towards Optimal Evaluation Efficiency for Large Language Models
EMNLP 2025
Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment
EMNLP 2025
Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
EMNLP 2025
ARGENT: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs
ACL 2025
SSA: Semantic Contamination of LLM-Driven Fake News Detection
EMNLP 2025
Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults
EMNLP 2025
DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors
EMNLP 2025
Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs
ACL 2025
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
EMNLP 2025
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
EMNLP 2025
Can LLMs simulate the same correct solutions to free-response math problems as real students?
EMNLP 2025
Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?
ACL 2025
RankedCOMET: Elevating a 2022 Baseline to a Top-5 Finish in the WMT 2025 QE Task
EMNLP 2025
<
1
…
16
17
18
…
67
>