← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Mercury: A Code Efficiency Benchmark for Code Large Language Models NIPS 2024

On the Rigour of Scientific Writing: Criteria, Analysis, and Insights EMNLP 2024

RepMatch: Quantifying Cross-Instance Similarities in Representation Space EMNLP 2024

Classifier Calibration with ROC-Regularized Isotonic Regression AISTATS 2024

Analyzing Explainer Robustness via Probabilistic Lipschitzness of Prediction Functions AISTATS 2024

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries ACL 2024

Auditing Local Explanations is Hard NIPS 2024

Categorical Generative Model Evaluation via Synthetic Distribution Coarsening AISTATS 2024

Understanding Generalization of Federated Learning via Stability: Heterogeneity Matters AISTATS 2024

QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation EMNLP 2024

“A good pun is its own reword”: Can Large Language Models Understand Puns? EMNLP 2024

Overcoming Common Flaws in the Evaluation of Selective Classification Systems NIPS 2024

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models EMNLP 2024

Humans or LLMs as the Judge? A Study on Judgement Bias EMNLP 2024

Differentially Private Equivalence Testing for Continuous Distributions and Applications NIPS 2024

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content NIPS 2024

On the Content Bias in Frechet Video Distance CVPR 2024

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards EMNLP 2024

LawBench: Benchmarking Legal Knowledge of Large Language Models EMNLP 2024

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives ACL 2024

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures ACL 2024

FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability ACL 2024

PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data EMNLP 2024

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? EMNLP 2024

InterrogateLLM: Zero-Resource Hallucination Detection in LLM-Generated Answers ACL 2024