Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
The Million Authors Corpus: A Cross-Lingual and Cross-Domain Wikipedia Dataset for Authorship Verification
ACL 2025
LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models
COLING 2025
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
ACL 2025
NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models
NAACL 2025
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
ACL 2025
Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting
COLING 2025
SPaRC: A Spatial Pathfinding Reasoning Challenge
EMNLP 2025
Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
EMNLP 2025
Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models
ACL 2025
What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios
COLING 2025
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling
ACL 2025
Examining False Positives under Inference Scaling for Mathematical Reasoning
EMNLP 2025
LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation
ACL 2025
Benchmarking Failures in Tool-Augmented Language Models
NAACL 2025
Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization
NAACL 2025
Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?
NAACL 2025
skLEP: A Slovak General Language Understanding Benchmark
ACL 2025
Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability
ICCV 2025
Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items
ACL 2025
Are We Done with MMLU?
NAACL 2025
HATS : Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models
ACL 2025
ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction
ICCV 2025
Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses
EMNLP 2025
GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems
EMNLP 2025
How Reliable is Multilingual LLM-as-a-Judge?
EMNLP 2025
<
1
…
14
15
16
…
67
>