Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Building a stable classifier with the inflated argmax
NIPS 2024
On the Worst Prompt Performance of Large Language Models
NIPS 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
NIPS 2024
Zipper: Addressing Degeneracy in Algorithm-Agnostic Inference
NIPS 2024
Are Large Language Models Good Statisticians?
NIPS 2024
Are Language Models Actually Useful for Time Series Forecasting?
NIPS 2024
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning
IJCAI 2024
MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge
IJCAI 2024
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
NIPS 2024
When Are Two Lists Better than One?: Benefits and Harms in Joint Decision-Making
AAAI 2024
PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining
NIPS 2024
WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking
NIPS 2024
Generative Interpretation: Toward Human-Like Evaluation for Educational Question-Answer Pair Generation
EACL 2024
When is a Metaphor Actually Novel? Annotating Metaphor Novelty in the Context of Automatic Metaphor Detection
EACL 2024
Are You Serious? Handling Disagreement When Annotating Conspiracy Theory Texts
EACL 2024
Donkii: Characterizing and Detecting Errors in Instruction-Tuning Datasets
EACL 2024
Can I trust You? LLMs as conversational agents
EACL 2024
Can Large Language Models Reason About Goal-Oriented Tasks?
EACL 2024
InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models
EACL 2024
Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity
EACL 2024
Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations
EACL 2024
WenMind: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Classical Literature and Language Arts
NIPS 2024
SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models
NIPS 2024
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
NIPS 2024
Are More LLM Calls All You Need? Towards the Scaling Properties of Compound AI Systems
NIPS 2024
<
1
…
35
36
37
…
67
>