Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean
COLING 2024
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property
COLING 2024
LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation
COLING 2024
KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark
COLING 2024
JCoLA: Japanese Corpus of Linguistic Acceptability
COLING 2024
Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks
COLING 2024
Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks
COLING 2024
Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives
ACL 2024
The State of Relation Extraction Data Quality: Is Bigger Always Better?
ACL 2024
“My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
ACL 2024
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries
ACL 2024
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
ACL 2024
Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration
ACL 2024
Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness
ACL 2024
StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code
ACL 2024
Rationales for Answers to Simple Math Word Problems Confuse Large Language Models
ACL 2024
Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization?
ACL 2024
Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark
ACL 2024
Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends
ACL 2024
SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes
ACL 2024
Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation
AAAI 2024
Improving GNN Calibration with Discriminative Ability: Insights and Strategies
AAAI 2024
A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint)
AAAI 2024
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
ACL 2024
An Empirical Investigation Into Benchmarking Model Multiplicity for Trustworthy Machine Learning: A Case Study on Image Classification
WACV 2024
<
1
…
23
24
25
…
67
>