← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Building a stable classifier with the inflated argmax NIPS 2024

On the Worst Prompt Performance of Large Language Models NIPS 2024

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? NIPS 2024

Zipper: Addressing Degeneracy in Algorithm-Agnostic Inference NIPS 2024

Are Large Language Models Good Statisticians? NIPS 2024

Are Language Models Actually Useful for Time Series Forecasting? NIPS 2024

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning IJCAI 2024

MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge IJCAI 2024

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NIPS 2024

When Are Two Lists Better than One?: Benefits and Harms in Joint Decision-Making AAAI 2024

PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining NIPS 2024

WelQrate: Defining the Gold Standard in Small Molecule Drug Discovery Benchmarking NIPS 2024

Generative Interpretation: Toward Human-Like Evaluation for Educational Question-Answer Pair Generation EACL 2024

When is a Metaphor Actually Novel? Annotating Metaphor Novelty in the Context of Automatic Metaphor Detection EACL 2024

Are You Serious? Handling Disagreement When Annotating Conspiracy Theory Texts EACL 2024

Donkii: Characterizing and Detecting Errors in Instruction-Tuning Datasets EACL 2024

Can I trust You? LLMs as conversational agents EACL 2024

Can Large Language Models Reason About Goal-Oriented Tasks? EACL 2024

InstructEval: Towards Holistic Evaluation of Instruction-Tuned Large Language Models EACL 2024

Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity EACL 2024

Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations EACL 2024

WenMind: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Classical Literature and Language Arts NIPS 2024

SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models NIPS 2024

A Careful Examination of Large Language Model Performance on Grade School Arithmetic NIPS 2024

Are More LLM Calls All You Need? Towards the Scaling Properties of Compound AI Systems NIPS 2024