Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models
COLING 2025
LogiDynamics: Unraveling the Dynamics of Inductive, Abductive and Deductive Logical Inferences in LLM Reasoning
EMNLP 2025
Memorization ≠ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?
EMNLP 2025
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
ACL 2025
The Emperor’s New Reasoning: Format Imitation Overshadows Genuine Mathematical Understanding in SFT
EMNLP 2025
ReproHum #0031-01: Reproducing the Human Evaluation of Readability from “It is AI’s Turn to Ask Humans a Question”
ACL 2025
Memorization or Reasoning? Exploring the Idiom Understanding of LLMs
EMNLP 2025
Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation
NAACL 2025
From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
EMNLP 2025
ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective
ACL 2025
Transitive self-consistency evaluation of NLI models without gold labels
EMNLP 2025
Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments
AAAI 2025
Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles
EMNLP 2025
ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation
ACL 2025
DCR: Quantifying Data Contamination in LLMs Evaluation
EMNLP 2025
A Simple and Comprehensive Benchmark for Single-Cell Transcriptomics
AAAI 2025
Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth
EMNLP 2025
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework
ACL 2025
Agent-as-Judge for Factual Summarization of Long Narratives
EMNLP 2025
Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations
AAAI 2025
Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration
EMNLP 2025
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving
ACL 2025
Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts
EMNLP 2025
EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
AAAI 2025
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
ACL 2025
<
1
…
18
19
20
…
67
>