Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models
ACL 2025
A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation
ACL 2025
Difficulty Estimation in Natural Language Tasks with Action Scores
NAACL 2025
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
ACL 2025
Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries
NAACL 2025
Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events
ACL 2025
Defining and Quantifying Visual Hallucinations in Vision-Language Models
NAACL 2025
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?
ACL 2025
BI-Bench : A Comprehensive Benchmark Dataset and Unsupervised Evaluation for BI Systems
ACL 2025
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
ACL 2025
“Stupid robot, I want to speak to a human!” User Frustration Detection in Task-Oriented Dialog Systems
COLING 2025
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
ACL 2025
A Theoretical Framework for Evaluating Narrative Surprise in Large Language Models
NAACL 2025
LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers
ACL 2025
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
ACL 2025
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
ACL 2025
A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment
ACL 2025
REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?
ACL 2025
Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations
WACV 2025
Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models
ACL 2025
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
ACL 2025
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
ACL 2025
Noise-Aware Evaluation of Object Detectors
WACV 2025
Frame by Familiar Frame: Understanding Replication in Video Diffusion Models
WACV 2025
Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls
EMNLP 2025
<
1
…
13
14
15
…
67
>