← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

VCSearch: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning EMNLP 2025

AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories EMNLP 2025

QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation ACL 2025

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation EMNLP 2025

EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models AAAI 2025

o-MEGA: Optimized Methods for Explanation Generation and Analysis EMNLP 2025

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction ICCV 2025

TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs EMNLP 2025

Membership Inference Attacks with False Discovery Rate Control ICCV 2025

SAGE: A Generic Framework for LLM Safety Evaluation EMNLP 2025

InductionBench: LLMs Fail in the Simplest Complexity Class ACL 2025

Truth, Trust, and Trouble: Medical AI on the Edge EMNLP 2025

An Empirical Study of Position Bias in Modern Information Retrieval EMNLP 2025

InstaJudge: Aligning Judgment Bias of LLM-as-Judge with Humans in Industry Applications EMNLP 2025

Conflicts in Texts: Data, Implications and Challenges EMNLP 2025

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes EMNLP 2025

HighMATH: Evaluating Math Reasoning of Large Language Models in Breadth and Depth EMNLP 2025

Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation EMNLP 2025

Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models ACL 2025

Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution ACL 2025

Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm ACL 2025

Theory of Mind in Large Language Models: Assessment and Enhancement ACL 2025

When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA EMNLP 2025

SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models ACL 2025

MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses EMNLP 2025