Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Transferability Bound Theory: Exploring Relationship between Adversarial Transferability and Flatness
NIPS 2024
WMT24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles
EMNLP 2024
Remember This Event That Year? Assessing Temporal Information and Understanding in Large Language Models
EMNLP 2024
Extrinsic Evaluation of Cultural Competence in Large Language Models
EMNLP 2024
Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
EMNLP 2024
Reshuffling Resampling Splits Can Improve Generalization of Hyperparameter Optimization
NIPS 2024
Are Large Language Models Consistent over Value-laden Questions?
EMNLP 2024
When ”A Helpful Assistant” Is Not Really Helpful: Personas in System Prompts Do Not Improve Performances of Large Language Models
EMNLP 2024
LEGOBench: Scientific Leaderboard Generation Benchmark
EMNLP 2024
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers
EMNLP 2024
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion
EMNLP 2024
VeriScore: Evaluating the factuality of verifiable claims in long-form text generation
EMNLP 2024
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence
EMNLP 2024
Can Language Models Recognize Convincing Arguments?
EMNLP 2024
The Potential and Challenges of Evaluating Attitudes, Opinions, and Values in Large Language Models
EMNLP 2024
Is Compound Aspect-Based Sentiment Analysis Addressed by LLMs?
EMNLP 2024
The Effect of Sampling Temperature on Problem Solving in Large Language Models
EMNLP 2024
SynthEval: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists
EMNLP 2024
From Generation to Selection: Findings of Converting Analogical Problem-Solving into Multiple-Choice Questions
EMNLP 2024
Large Language Models' Expert-level Global History Knowledge Benchmark (HiST-LLM)
NIPS 2024
bgGLUE: A Bulgarian General Language Understanding Evaluation Benchmark
ACL 2023
WRF: Weighted Rouge-F1 Metric for Entity Recognition
IJCNLP 2023
Delving into Evaluation Metrics for Generation: A Thorough Assessment of How Metrics Generalize to Rephrasing Across Languages
IJCNLP 2023
Which Shortcut Solution Do Question Answering Models Prefer to Learn?
AAAI 2023
Responsible AI Considerations in Text Summarization Research: A Review of Current Practices
EMNLP 2023
<
1
…
36
37
38
…
67
>