← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

ECON: On the Detection and Resolution of Evidence Conflicts EMNLP 2024

Do Large Language Models Know How Much They Know? EMNLP 2024

I am a Strange Dataset: Metalinguistic Tests for Language Models ACL 2024

On the Reliability of Psychological Scales on Large Language Models EMNLP 2024

CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios EMNLP 2024

Monoculture in Matching Markets NIPS 2024

PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining NIPS 2024

An Analysis of Multilingual FActScore EMNLP 2024

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NIPS 2024

When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives EMNLP 2024

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models EMNLP 2024

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking NIPS 2024

FOOL ME IF YOU CAN! An Adversarial Dataset to Investigate the Robustness of LMs in Word Sense Disambiguation EMNLP 2024

Beyond Reference: Evaluating High Quality Translations Better than Human References EMNLP 2024

Are Language Models Actually Useful for Time Series Forecasting? NIPS 2024

Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations EACL 2024

Mobile-Bench: An Evaluation Benchmark for LLM-based Mobile Agents ACL 2024

An Audit on the Perspectives and Challenges of Hallucinations in NLP EMNLP 2024

Efficient LLM Comparative Assessment: A Product of Experts Framework for Pairwise Comparisons EMNLP 2024

Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity EACL 2024

Assessing “Implicit” Retrieval Robustness of Large Language Models EMNLP 2024

LUQ: Long-text Uncertainty Quantification for LLMs EMNLP 2024

Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models? EMNLP 2024

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method EMNLP 2024

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions EMNLP 2024