← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Generative Interpretation: Toward Human-Like Evaluation for Educational Question-Answer Pair Generation EACL 2024

Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness ACL 2024

Conditional and Modal Reasoning in Large Language Models EMNLP 2024

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark NIPS 2024

SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes ACL 2024

What Kind of Sourcery is This? Evaluating GPT-4’s Performance on Linking Scientific Fact to Citations EMNLP 2024

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic ACL 2024

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation ACL 2024

Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates EMNLP 2024

Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models ACL 2024

Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark ACL 2024

Intuitive or Dependent? Investigating LLMs’ Behavior Style to Conflicting Prompts ACL 2024

UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models EMNLP 2024

Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks ACL 2024

Understanding and Mitigating Language Confusion in LLMs EMNLP 2024

FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability ACL 2024

BotEval: Facilitating Interactive Human Evaluation ACL 2024

UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs ACL 2024

LJPCheck: Functional Tests for Legal Judgment Prediction ACL 2024

MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans? EMNLP 2024

PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data EMNLP 2024

EASSE-DE & EASSE-multi: Easier Automatic Sentence Simplification Evaluation for German & Multiple Languages EMNLP 2024

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends ACL 2024

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries ACL 2024

How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics EMNLP 2024