← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Transferability Bound Theory: Exploring Relationship between Adversarial Transferability and Flatness NIPS 2024

WMT24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles EMNLP 2024

Remember This Event That Year? Assessing Temporal Information and Understanding in Large Language Models EMNLP 2024

Extrinsic Evaluation of Cultural Competence in Large Language Models EMNLP 2024

Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets EMNLP 2024

Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization NIPS 2024

Are Large Language Models Consistent over Value-laden Questions? EMNLP 2024

When ”A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models EMNLP 2024

LEGOBench: Scientific Leaderboard Generation Benchmark EMNLP 2024

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers EMNLP 2024

PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion EMNLP 2024

VeriScore: Evaluating the factuality of verifiable claims in long-form text generation EMNLP 2024

PEDANTS: Cheap but Effective and Interpretable Answer Equivalence EMNLP 2024

Can Language Models Recognize Convincing Arguments? EMNLP 2024

The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models EMNLP 2024

Is Compound Aspect-Based Sentiment Analysis Addressed by LLMs? EMNLP 2024

The Effect of Sampling Temperature on Problem Solving in Large Language Models EMNLP 2024

SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists EMNLP 2024

From Generation to Selection: Findings of Converting Analogical Problem-Solving into Multiple-Choice Questions EMNLP 2024

Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM) NIPS 2024

bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark ACL 2023

WRF: Weighted Rouge-F1 Metric for Entity Recognition IJCNLP 2023

Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages IJCNLP 2023

Which Shortcut Solution Do Question Answering Models Prefer to Learn? AAAI 2023

Responsible AI Considerations in Text Summarization Research: A Review of Current Practices EMNLP 2023