← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

The Million Authors Corpus: A Cross-Lingual and Cross-Domain Wikipedia Dataset for Authorship Verification ACL 2025

LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models COLING 2025

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios ACL 2025

NormAd: A Framework for Measuring the Cultural Adaptability of Large Language Models NAACL 2025

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation ACL 2025

Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting COLING 2025

SPaRC: A Spatial Pathfinding Reasoning Challenge EMNLP 2025

Are Checklists Really Useful for Automatic Evaluation of Generative Tasks? EMNLP 2025

Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models ACL 2025

What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios COLING 2025

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling ACL 2025

Examining False Positives under Inference Scaling for Mathematical Reasoning EMNLP 2025

LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation ACL 2025

Benchmarking Failures in Tool-Augmented Language Models NAACL 2025

Beyond Benchmarks: Building a Richer Cross-Document Event Coreference Dataset with Decontextualization NAACL 2025

Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs? NAACL 2025

skLEP: A Slovak General Language Understanding Benchmark ACL 2025

Automated Model Evaluation for Object Detection via Prediction Consistency and Reliability ICCV 2025

Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items ACL 2025

Are We Done with MMLU? NAACL 2025

HATS : Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models ACL 2025

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction ICCV 2025

Transparent Reference-free Automated Evaluation of Open-Ended User Survey Responses EMNLP 2025

GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems EMNLP 2025

How Reliable is Multilingual LLM-as-a-Judge? EMNLP 2025