Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
ACL 2024
Improving GNN Calibration with Discriminative Ability: Insights and Strategies
AAAI 2024
Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation
AAAI 2024
A linguistically-motivated evaluation methodology for unraveling model’s abilities in reading comprehension tasks
EMNLP 2024
Measuring the Inconsistency of Large Language Models in Preferential Ranking
ACL 2024
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction
AAAI 2024
Variable Importance in High-Dimensional Settings Requires Grouping
AAAI 2024
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
EMNLP 2024
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
NIPS 2024
Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models
EMNLP 2024
Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs
EMNLP 2024
Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm
NIPS 2024
Toward Conditional Distribution Calibration in Survival Prediction
NIPS 2024
What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations
EMNLP 2024
MMTE: Corpus and Metrics for Evaluating Machine Translation Quality of Metaphorical Language
EMNLP 2024
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
EMNLP 2024
Evaluating the Effectiveness of Large Language Models in Establishing Conversational Grounding
EMNLP 2024
DataTales: A Benchmark for Real-World Intelligent Data Narration
EMNLP 2024
MAIR: A Massive Benchmark for Evaluating Instructed Retrieval
EMNLP 2024
ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models
ACL 2024
TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models
ACL 2024
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models
ACL 2024
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
ACL 2024
ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models
ACL 2024
Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering
NIPS 2024
<
1
…
29
30
31
…
67
>