Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
On the Content Bias in Frechet Video Distance
CVPR 2024
On the Faithfulness of Vision Transformer Explanations
CVPR 2024
A Toolbox for Modelling Engagement with Educational Videos
AAAI 2024
Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling
AAAI 2024
LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction
AAAI 2024
A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark
CVPR 2024
The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes
CVPR 2024
Can Biases in ImageNet Models Explain Generalization?
CVPR 2024
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models
ACL 2024
Measuring the Inconsistency of Large Language Models in Preferential Ranking
ACL 2024
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
ACL 2024
PUB: A Pragmatics Understanding Benchmark for Assessing LLMs’ Pragmatics Capabilities
ACL 2024
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
ACL 2024
Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in Machine Translation Evaluation
ACL 2024
CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models
ACL 2024
Data Contamination Calibration for Black-box LLMs
ACL 2024
ACUEval: Fine-grained Hallucination Evaluation and Correction for Abstractive Summarization
ACL 2024
Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data
ACL 2024
TaskBench: Benchmarking Large Language Models for Task Automation
NIPS 2024
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models
NIPS 2024
Towards Human-AI Complementarity with Prediction Sets
NIPS 2024
Questioning the Survey Responses of Large Language Models
NIPS 2024
Rethinking LLM Memorization through the Lens of Adversarial Compression
NIPS 2024
Conformalized Multiple Testing after Data-dependent Selection
NIPS 2024
An engine not a camera: Measuring performative power of online search
NIPS 2024
<
1
…
33
34
35
…
67
>