Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
NIPS 2024
On the Reliability of Psychological Scales on Large Language Models
EMNLP 2024
Replicability in Learning: Geometric Partitions and KKM-Sperner Lemma
NIPS 2024
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models
EMNLP 2024
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
NIPS 2024
Assessing “Implicit” Retrieval Robustness of Large Language Models
EMNLP 2024
Efficient Lifelong Model Evaluation in an Era of Rapid Progress
NIPS 2024
kGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution
NIPS 2024
Evaluating LLMs for Targeted Concept Simplification for Domain-Specific Texts
EMNLP 2024
An engine not a camera: Measuring performative power of online search
NIPS 2024
Conformalized Multiple Testing after Data-dependent Selection
NIPS 2024
Rethinking LLM Memorization through the Lens of Adversarial Compression
NIPS 2024
Questioning the Survey Responses of Large Language Models
NIPS 2024
Towards Human-AI Complementarity with Prediction Sets
NIPS 2024
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models
NIPS 2024
TaskBench: Benchmarking Large Language Models for Task Automation
NIPS 2024
ECON: On the Detection and Resolution of Evidence Conflicts
EMNLP 2024
PUB: A Pragmatics Understanding Benchmark for Assessing LLMs’ Pragmatics Capabilities
ACL 2024
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
ACL 2024
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
ACL 2024
Measuring the Inconsistency of Large Language Models in Preferential Ranking
ACL 2024
MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge
IJCAI 2024
Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning
IJCAI 2024
ACUEval: Fine-grained Hallucination Evaluation and Correction for Abstractive Summarization
ACL 2024
Data Contamination Calibration for Black-box LLMs
ACL 2024
<
1
…
32
33
34
…
67
>