← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Categorical Generative Model Evaluation via Synthetic Distribution Coarsening AISTATS 2024

Understanding Generalization of Federated Learning via Stability: Heterogeneity Matters AISTATS 2024

QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation EMNLP 2024

“A good pun is its own reword”: Can Large Language Models Understand Puns? EMNLP 2024

Unexpected Phenomenon: LLMs’ Spurious Associations in Information Extraction ACL 2024

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models EMNLP 2024

Humans or LLMs as the Judge? A Study on Judgement Bias EMNLP 2024

Rationales for Answers to Simple Math Word Problems Confuse Large Language Models ACL 2024

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards EMNLP 2024

LawBench: Benchmarking Legal Knowledge of Large Language Models EMNLP 2024

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code ACL 2024

Confidence Under the Hood: An Investigation into the Confidence-Probability Alignment in Large Language Models ACL 2024

Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness ACL 2024

PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data EMNLP 2024

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? EMNLP 2024

ProLex: A Benchmark for Language Proficiency-oriented Lexical Substitution ACL 2024

Unveiling the Tapestry of Automated Essay Scoring: A Comprehensive Investigation of Accuracy, Fairness, and Generalizability AAAI 2024

Understanding and Mitigating Language Confusion in LLMs EMNLP 2024

UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models EMNLP 2024

On the Importance of Application-Grounded Experimental Design for Evaluating Explainable ML Methods AAAI 2024

FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability ACL 2024

Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates EMNLP 2024

Moderate Message Passing Improves Calibration: A Universal Way to Mitigate Confidence Bias in Graph Neural Networks AAAI 2024

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations EMNLP 2024

Every Answer Matters: Evaluating Commonsense with Probabilistic Measures ACL 2024