model evaluation

442 papers

Explore in graph

Co-occurring keywords

large language model (12755) benchmark evaluation (1539) text classification (6776) language model (4573) natural language processing (2027) evaluation benchmark (250) natural language inference (1278) multimodal learning (4622) bias detection (419) transfer learning (5442)

Papers

FindTheFlaws: Annotated Errors for Detecting Flawed Reasoning and Scalable Oversight Research AAAI 2026

Unveiling the Deficiencies of Pre-trained Text-and-Layout Models in Real-world Visually-rich Document Information Extraction EACL 2026

On the Evaluation of Capability Estimation Methods for Large Language Models AAAI 2026

Confirmation Bias: A Challenge for Scalable Oversight AAAI 2026

Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality EACL 2026

Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework EACL 2026

Cognitive Effects and Biases in Large Language Models EACL 2026

Trove: A Flexible Toolkit for Dense Retrieval EACL 2026

Measuring Model Performance in the Presence of an Intervention AAAI 2026

Effective Strategies for Teaching Machine Learning AAAI 2026

LLMs as Span Annotators: A Comparative Study of LLMs and Humans EACL 2026

From Calculation to Adjudication: Examining LLM Judges on Mathematical Reasoning Tasks ACL 2025

RCScore: Quantifying Response Consistency in Large Language Models EMNLP 2025

Benchmarking Distributional Alignment of Large Language Models NAACL 2025

SeaExam and SeaBench: Benchmarking LLMs with Local Multilingual Questions in Southeast Asia NAACL 2025

DUTJBD at SemEval-2025 Task 3: A Range of Approaches for Predicting Hallucination Generation in Models SEMEVAL 2025

Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics ACL 2025

NOVA-63: Native Omni-lingual Versatile Assessments of 63 Disciplines EMNLP 2025

AdamMeme: Adaptively Probe the Reasoning Capacity of Multimodal Large Language Models on Harmfulness ACL 2025

Towards Comprehensive Evaluation of Open-Source Language Models: A Multi-Dimensional, User-Driven Approach ACL 2025

What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models ACL 2025

SkillVerse : Assessing and Enhancing LLMs with Tree Evaluation ACL 2025

FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation ACL 2025

skLEP: A Slovak General Language Understanding Benchmark ACL 2025

Towards Open Foundation Language Model and Corpus for Macedonian: A Low-Resource Language ACL 2025