← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction ICCV 2025

Graders Should Cheat: Privileged Information Enables Expert-Level Automated Evaluations EMNLP 2025

Are Bias Evaluation Methods Biased ? ACL 2025

Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation EMNLP 2025

What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios COLING 2025

BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language COLING 2025

Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting COLING 2025

No Size Fits All: The Perils and Pitfalls of Leveraging LLMs Vary with Company Size COLING 2025

Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator COLING 2025

neDIOM: Dataset and Analysis of Nepali Idioms COLING 2025

Can AI Make Us Laugh? Comparing Jokes Generated by Witscript and a Human Expert COLING 2025

On Crowdsourcing Task Design for Discourse Relation Annotation COLING 2025

Sources of Disagreement in Data for LLM Instruction Tuning COLING 2025

Evaluating Financial Literacy of Large Language Models through Domain Specific Languages for Plain Text Accounting COLING 2025

AveniBench: Accessible and Versatile Evaluation of Finance Intelligence COLING 2025

FinNLP-FNP-LLMFinLegal-2025 Shared Task: Financial Misinformation Detection Challenge Task COLING 2025

Benchmarking AI Text Detection: Assessing Detectors Against New Datasets, Evasion Tactics, and Enhanced LLMs COLING 2025

GPT-4 is Judged More Human than Humans in Displaced and Inverted Turing Tests COLING 2025

Evaluating Structural and Linguistic Quality in Urdu DRS Parsing and Generation through Bidirectional Evaluation COLING 2025

CaLQuest.PT: Towards the Collection and Evaluation of Natural Causal Ladder Questions in Portuguese for AI Agents COLING 2025

Towards Inclusive Arabic LLMs: A Culturally Aligned Benchmark in Arabic Large Language Model Evaluation COLING 2025

Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models COLING 2025

The First Workshop on Multilingual Counterspeech Generation at COLING 2025: Overview of the Shared Task COLING 2025

Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning COLING 2025

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench ACL 2025