Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
CriticEval: Evaluating Large-scale Language Model as Critic
NIPS 2024
Compact Proofs of Model Performance via Mechanistic Interpretability
NIPS 2024
Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics
NIPS 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
NIPS 2024
Uncertainty in Language Models: Assessment through Rank-Calibration
EMNLP 2024
Extrinsic Evaluation of Machine Translation Metrics
ACL 2023
NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist
ACL 2023
Rogue Scores
ACL 2023
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
ACL 2023
Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP
ACL 2023
TextVerifier: Robustness Verification for Textual Classifiers with Certifiable Guarantees
ACL 2023
On the Limitations of Simulating Active Learning
ACL 2023
A Call for Standardization and Validation of Text Style Transfer Evaluation
ACL 2023
Data Sampling and (In)stability in Machine Translation Evaluation
ACL 2023
Scalable and Explainable Automated Scoring for Open-Ended Constructed Response Math Word Problems
ACL 2023
C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation
ACL 2023
Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of Lexical Overlap in Train and Test Reference Summaries
EMNLP 2023
Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Can Large Language Models pass the test?
EMNLP 2023
Zero-Shot Data Maps. Efficient Dataset Cartography Without Model Training
EMNLP 2023
Is GPT-4 a Good Data Analyst?
EMNLP 2023
Estimating Large Language Model Capabilities without Labeled Test Data
EMNLP 2023
On the Calibration of Large Language Models and Alignment
EMNLP 2023
Decoding Stumpers: Large Language Models vs. Human Problem-Solvers
EMNLP 2023
IAEval: A Comprehensive Evaluation of Instance Attribution on Natural Language Understanding
EMNLP 2023
SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research
EMNLP 2023
<
1
…
10
11
12
…
21
>