← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models COLING 2025

LogiDynamics: Unraveling the Dynamics of Inductive, Abductive and Deductive Logical Inferences in LLM Reasoning EMNLP 2025

Memorization ≠ Understanding: Do Large Language Models Have the Ability of Scenario Cognition? EMNLP 2025

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL 2025

The Emperor’s New Reasoning: Format Imitation Overshadows Genuine Mathematical Understanding in SFT EMNLP 2025

ReproHum #0031-01: Reproducing the Human Evaluation of Readability from “It is AI’s Turn to Ask Humans a Question” ACL 2025

Memorization or Reasoning? Exploring the Idiom Understanding of LLMs EMNLP 2025

Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation NAACL 2025

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models EMNLP 2025

ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective ACL 2025

Transitive self-consistency evaluation of NLI models without gold labels EMNLP 2025

Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments AAAI 2025

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles EMNLP 2025

ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation ACL 2025

DCR: Quantifying Data Contamination in LLMs Evaluation EMNLP 2025

A Simple and Comprehensive Benchmark for Single-Cell Transcriptomics AAAI 2025

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth EMNLP 2025

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework ACL 2025

Agent-as-Judge for Factual Summarization of Long Narratives EMNLP 2025

Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations AAAI 2025

Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration EMNLP 2025

ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving ACL 2025

Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts EMNLP 2025

EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models AAAI 2025

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning ACL 2025