Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction
ICCV 2025
Graders Should Cheat: Privileged Information Enables Expert-Level Automated Evaluations
EMNLP 2025
Are Bias Evaluation Methods Biased ?
ACL 2025
Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation
EMNLP 2025
What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios
COLING 2025
BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language
COLING 2025
Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting
COLING 2025
No Size Fits All: The Perils and Pitfalls of Leveraging LLMs Vary with Company Size
COLING 2025
Is my Meeting Summary Good? Estimating Quality with a Multi-LLM Evaluator
COLING 2025
neDIOM: Dataset and Analysis of Nepali Idioms
COLING 2025
Can AI Make Us Laugh? Comparing Jokes Generated by Witscript and a Human Expert
COLING 2025
On Crowdsourcing Task Design for Discourse Relation Annotation
COLING 2025
Sources of Disagreement in Data for LLM Instruction Tuning
COLING 2025
Evaluating Financial Literacy of Large Language Models through Domain Specific Languages for Plain Text Accounting
COLING 2025
AveniBench: Accessible and Versatile Evaluation of Finance Intelligence
COLING 2025
FinNLP-FNP-LLMFinLegal-2025 Shared Task: Financial Misinformation Detection Challenge Task
COLING 2025
Benchmarking AI Text Detection: Assessing Detectors Against New Datasets, Evasion Tactics, and Enhanced LLMs
COLING 2025
GPT-4 is Judged More Human than Humans in Displaced and Inverted Turing Tests
COLING 2025
Evaluating Structural and Linguistic Quality in Urdu DRS Parsing and Generation through Bidirectional Evaluation
COLING 2025
CaLQuest.PT: Towards the Collection and Evaluation of Natural Causal Ladder Questions in Portuguese for AI Agents
COLING 2025
Towards Inclusive Arabic LLMs: A Culturally Aligned Benchmark in Arabic Large Language Model Evaluation
COLING 2025
Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models
COLING 2025
The First Workshop on Multilingual Counterspeech Generation at COLING 2025: Overview of the Shared Task
COLING 2025
Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning
COLING 2025
UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench
ACL 2025
<
1
…
17
18
19
…
67
>