Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Efficient Lifelong Model Evaluation in an Era of Rapid Progress
NIPS 2024
kGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution
NIPS 2024
BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages
NIPS 2024
Replicability in Learning: Geometric Partitions and KKM-Sperner Lemma
NIPS 2024
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
NIPS 2024
GenAI Arena: An Open Evaluation Platform for Generative Models
NIPS 2024
Long-form factuality in large language models
NIPS 2024
SWT-Bench: Testing and Validating Real-World Bug-Fixes with Code Agents
NIPS 2024
Benchmark Data Repositories for Better Benchmarking
NIPS 2024
IDGen: Item Discrimination Induced Prompt Generation for LLM Evaluation
NIPS 2024
ConStat: Performance-Based Contamination Detection in Large Language Models
NIPS 2024
Measuring Multimodal Mathematical Reasoning with MATH-Vision Dataset
NIPS 2024
TRAM: Benchmarking Temporal Reasoning for Large Language Models
ACL 2024
Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models
ACL 2024
Language Models can Evaluate Themselves via Probability Discrepancy
ACL 2024
DebugBench: Evaluating Debugging Capability of Large Language Models
ACL 2024
DINGO: Towards Diverse and Fine-Grained Instruction-Following Evaluation
AAAI 2024
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
NIPS 2024
Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering
NIPS 2024
Toward Conditional Distribution Calibration in Survival Prediction
NIPS 2024
Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2)
NIPS 2024
Demonstrating Arena 3.0: Advancing Social Navigation in Collaborative and Highly Dynamic Environments
RSS 2024
Benchmarking Estimators for Natural Experiments: A Novel Dataset and a Doubly Robust Algorithm
NIPS 2024
Monoculture in Matching Markets
NIPS 2024
Testing Semantic Importance via Betting
NIPS 2024
<
1
…
34
35
36
…
67
>