← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Multi-Dimensional Machine Translation Evaluation: Model Evaluation and Resource for Korean COLING 2024

MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property COLING 2024

LongDocFACTScore: Evaluating the Factuality of Long Document Abstractive Summarisation COLING 2024

KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark COLING 2024

JCoLA: Japanese Corpus of Linguistic Acceptability COLING 2024

Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks COLING 2024

Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks COLING 2024

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives ACL 2024

The State of Relation Extraction Data Quality: Is Bigger Always Better? ACL 2024

“My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models ACL 2024

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries ACL 2024

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark ACL 2024

Cocktail: A Comprehensive Information Retrieval Benchmark with LLM-Generated Documents Integration ACL 2024

Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness ACL 2024

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code ACL 2024

Rationales for Answers to Simple Math Word Problems Confuse Large Language Models ACL 2024

Speech vs. Transcript: Does It Matter for Human Annotators in Speech Summarization? ACL 2024

Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark ACL 2024

Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends ACL 2024

SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes ACL 2024

Uncertainty Quantification for Data-Driven Change-Point Learning via Cross-Validation AAAI 2024

Improving GNN Calibration with Discriminative Ability: Insights and Strategies AAAI 2024

A General Model for Aggregating Annotations AcrossSimple, Complex, and Multi-object Annotation Tasks (Abstract Reprint) AAAI 2024

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic ACL 2024

An Empirical Investigation Into Benchmarking Model Multiplicity for Trustworthy Machine Learning: A Case Study on Image Classification WACV 2024