← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles NIPS 2024

On the Reliability of Psychological Scales on Large Language Models EMNLP 2024

Replicability in Learning: Geometric Partitions and KKM-Sperner Lemma NIPS 2024

Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models EMNLP 2024

BLEnD: A Benchmark for LLMs on Everyday Knowledge in Diverse Cultures and Languages NIPS 2024

Assessing “Implicit” Retrieval Robustness of Large Language Models EMNLP 2024

Efficient Lifelong Model Evaluation in an Era of Rapid Progress NIPS 2024

kGym: A Platform and Dataset to Benchmark Large Language Models on Linux Kernel Crash Resolution NIPS 2024

Evaluating LLMs for Targeted Concept Simplification for Domain-Specific Texts EMNLP 2024

An engine not a camera: Measuring performative power of online search NIPS 2024

Conformalized Multiple Testing after Data-dependent Selection NIPS 2024

Rethinking LLM Memorization through the Lens of Adversarial Compression NIPS 2024

Questioning the Survey Responses of Large Language Models NIPS 2024

Towards Human-AI Complementarity with Prediction Sets NIPS 2024

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models NIPS 2024

TaskBench: Benchmarking Large Language Models for Task Automation NIPS 2024

ECON: On the Detection and Resolution of Evidence Conflicts EMNLP 2024

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs’ Pragmatics Capabilities ACL 2024

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models ACL 2024

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models ACL 2024

Measuring the Inconsistency of Large Language Models in Preferential Ranking ACL 2024

MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge IJCAI 2024

Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning IJCAI 2024

ACUEval: Fine-grained Hallucination Evaluation and Correction for Abstractive Summarization ACL 2024

Data Contamination Calibration for Black-box LLMs ACL 2024