← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

“A good pun is its own reword”: Can Large Language Models Understand Puns? EMNLP 2024

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models EMNLP 2024

Humans or LLMs as the Judge? A Study on Judgement Bias EMNLP 2024

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards EMNLP 2024

LawBench: Benchmarking Legal Knowledge of Large Language Models EMNLP 2024

F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods ACL 2024

PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data EMNLP 2024

Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words? EMNLP 2024

Large Language Models are not Fair Evaluators ACL 2024

Understanding and Mitigating Language Confusion in LLMs EMNLP 2024

UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models EMNLP 2024

T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step ACL 2024

Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates EMNLP 2024

ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations EMNLP 2024

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method EMNLP 2024

LUQ: Long-text Uncertainty Quantification for LLMs EMNLP 2024

Why Don’t Prompt-Based Fairness Metrics Correlate? ACL 2024

MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception ACL 2024

Beyond Reference: Evaluating High Quality Translations Better than Human References EMNLP 2024

FOOL ME IF YOU CAN! An Adversarial Dataset to Investigate the Robustness of LMs in Word Sense Disambiguation EMNLP 2024

An Analysis of Multilingual FActScore EMNLP 2024

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models EMNLP 2024

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives EMNLP 2024

Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations EACL 2024

Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity EACL 2024