← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Can Large Language Models Win the International Mathematical Games? EMNLP 2025

Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations WACV 2025

Noise-Aware Evaluation of Object Detectors WACV 2025

Frame by Familiar Frame: Understanding Replication in Video Diffusion Models WACV 2025

A Rapid Test for Accuracy and Bias of Face Recognition Technology WACV 2025

Calibrating LLM Confidence by Probing Perturbed Representation Stability EMNLP 2025

LLMs cannot spot math errors, even when allowed to peek into the solution EMNLP 2025

REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities ACL 2025

Long-Form Information Alignment Evaluation Beyond Atomic Facts EMNLP 2025

Benchmarking AI Text Detection: Assessing Detectors Against New Datasets, Evasion Tactics, and Enhanced LLMs COLING 2025

SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? EMNLP 2025

Bridging AI and Carbon Capture: A Dataset for LLMs in Ionic Liquids and CBE Research ACL 2025

Towards Optimal Evaluation Efficiency for Large Language Models EMNLP 2025

Can LLMs Express Personality Across Cultures? Introducing CulturalPersonas for Evaluating Trait Alignment EMNLP 2025

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation EMNLP 2025

ARGENT: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs ACL 2025

SSA: Semantic Contamination of LLM-Driven Fake News Detection EMNLP 2025

Out of the Box, into the Clinic? Evaluating State-of-the-Art ASR for Clinical Applications for Older Adults EMNLP 2025

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors EMNLP 2025

Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs ACL 2025

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists EMNLP 2025

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA EMNLP 2025

Can LLMs simulate the same correct solutions to free-response math problems as real students? EMNLP 2025

Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation? ACL 2025

RankedCOMET: Elevating a 2022 Baseline to a Top-5 Finish in the WMT 2025 QE Task EMNLP 2025