← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

CriticEval: Evaluating Large-scale Language Model as Critic NIPS 2024

Compact Proofs of Model Performance via Mechanistic Interpretability NIPS 2024

Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics NIPS 2024

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? NIPS 2024

Uncertainty in Language Models: Assessment through Rank-Calibration EMNLP 2024

Extrinsic Evaluation of Machine Translation Metrics ACL 2023

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist ACL 2023

Rogue Scores ACL 2023

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation ACL 2023

Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP ACL 2023

TextVerifier: Robustness Verification for Textual Classifiers with Certifiable Guarantees ACL 2023

On the Limitations of Simulating Active Learning ACL 2023

A Call for Standardization and Validation of Text Style Transfer Evaluation ACL 2023

Data Sampling and (In)stability in Machine Translation Evaluation ACL 2023

Scalable and Explainable Automated Scoring for Open-Ended Constructed Response Math Word Problems ACL 2023

C-PMI: Conditional Pointwise Mutual Information for Turn-level Dialogue Evaluation ACL 2023

Lexical Repetitions Lead to Rote Learning: Unveiling the Impact of Lexical Overlap in Train and Test Reference Summaries EMNLP 2023

Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Can Large Language Models pass the test? EMNLP 2023

Zero-Shot Data Maps. Efficient Dataset Cartography Without Model Training EMNLP 2023

Is GPT-4 a Good Data Analyst? EMNLP 2023

Estimating Large Language Model Capabilities without Labeled Test Data EMNLP 2023

On the Calibration of Large Language Models and Alignment EMNLP 2023

Decoding Stumpers: Large Language Models vs. Human Problem-Solvers EMNLP 2023

IAEval: A Comprehensive Evaluation of Instance Attribution on Natural Language Understanding EMNLP 2023

SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research EMNLP 2023