← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

DiQAD: A Benchmark Dataset for Open-domain Dialogue Quality Assessment EMNLP 2023

Exploring the Numerical Reasoning Capabilities of Language Models: A Comprehensive Analysis on Tabular Data EMNLP 2023

Give Me the Facts! A Survey on Factual Knowledge Probing in Pre-trained Language Models EMNLP 2023

FFAEval: Evaluating Dialogue System via Free-For-All Ranking EMNLP 2023

The Linearity of the Effect of Surprisal on Reading Times across Languages EMNLP 2023

Measuring Pointwise 𝒱-Usable Information In-Context-ly EMNLP 2023

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification EMNLP 2023

Scalable Membership Inference Attacks via Quantile Regression NIPS 2023

Generalized test utilities for long-tail performance in extreme multi-label classification NIPS 2023

Calibration by Distribution Matching: Trainable Kernel Calibration Metrics NIPS 2023

Mathematical Capabilities of ChatGPT NIPS 2023

Mass-Producing Failures of Multimodal Systems with Language Models NIPS 2023

Statistical Knowledge Assessment for Large Language Models NIPS 2023

Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples NIPS 2023

GNNEvaluator: Evaluating GNN Performance On Unseen Graphs Without Labels NIPS 2023

FELM: Benchmarking Factuality Evaluation of Large Language Models NIPS 2023

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena NIPS 2023

Layer or Representation Space: What Makes BERT-based Evaluation Metrics Robust? COLING 2022

Imbalance-Aware Uplift Modeling for Observational Data AAAI 2022

When AI Difficulty Is Easy: The Explanatory Power of Predicting IRT Difficulty AAAI 2022

Play the Shannon Game with Language Models: A Human-Free Approach to Summary Evaluation AAAI 2022

The King Is Naked: On the Notion of Robustness for Natural Language Processing AAAI 2022

A Simulation-Based Evaluation Framework for Interactive AI Systems and Its Application AAAI 2022

Using Sampling to Estimate and Improve Performance of Automated Scoring Systems with Guarantees AAAI 2022

Local Structure Matters Most in Most Languages AACL 2022