Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
DiQAD: A Benchmark Dataset for Open-domain Dialogue Quality Assessment
EMNLP 2023
Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data
EMNLP 2023
Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained Language Models
EMNLP 2023
FFAEval: Evaluating Dialogue System via Free-For-All Ranking
EMNLP 2023
The Linearity of the Effect of Surprisal on Reading Times across Languages
EMNLP 2023
Measuring Pointwise 𝒱-Usable Information In-Context-ly
EMNLP 2023
Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification
EMNLP 2023
Scalable Membership Inference Attacks via Quantile Regression
NIPS 2023
Generalized test utilities for long-tail performance in extreme multi-label classification
NIPS 2023
Calibration by Distribution Matching: Trainable Kernel Calibration Metrics
NIPS 2023
Mathematical Capabilities of ChatGPT
NIPS 2023
Mass-Producing Failures of Multimodal Systems with Language Models
NIPS 2023
Statistical Knowledge Assessment for Large Language Models
NIPS 2023
Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples
NIPS 2023
GNNEvaluator: Evaluating GNN Performance On Unseen Graphs Without Labels
NIPS 2023
FELM: Benchmarking Factuality Evaluation of Large Language Models
NIPS 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
NIPS 2023
Layer or Representation Space: What Makes BERT-based Evaluation Metrics Robust?
COLING 2022
Imbalance-Aware Uplift Modeling for Observational Data
AAAI 2022
When AI Difficulty Is Easy: The Explanatory Power of Predicting IRT Difficulty
AAAI 2022
Play the Shannon Game with Language Models: A Human-Free Approach to Summary Evaluation
AAAI 2022
The King Is Naked: On the Notion of Robustness for Natural Language Processing
AAAI 2022
A Simulation-Based Evaluation Framework for Interactive AI Systems and Its Application
AAAI 2022
Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees
AAAI 2022
Local Structure Matters Most in Most Languages
AACL 2022
<
1
…
45
46
47
…
67
>