model evaluation

442 papers

Explore in graph

Co-occurring keywords

large language model (12755) benchmark evaluation (1539) text classification (6776) language model (4573) natural language processing (2027) evaluation benchmark (250) natural language inference (1278) multimodal learning (4622) bias detection (419) transfer learning (5442)

Papers

Consolidating and Developing Benchmarking Datasets for the Nepali Natural Language Understanding Tasks IJCNLP 2025

Beyond Single-Value Metrics: Evaluating and Enhancing LLM Unlearning with Cognitive Diagnosis ACL 2025

Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models ACL 2025

Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation EMNLP 2025

Evaluating Sensitivity Consistency of Explanations WACV 2025

JuStRank: Benchmarking LLM Judges for System Ranking ACL 2025

Benchmarking Distributional Alignment of Large Language Models NAACL 2025

Advancing Language Models through Instruction Tuning: Recent Progress and Challenges EMNLP 2025

A Rapid Test for Accuracy and Bias of Face Recognition Technology WACV 2025

Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models ACL 2025

LLMs May Perform MCQA by Selecting the Least Incorrect Option COLING 2025

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs ACL 2025

A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models ACL 2025

Language Model Probabilities are Not Calibrated in Numeric Contexts ACL 2025

Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events ACL 2025

BOSE: A Systematic Evaluation Method Optimized for Base Models ACL 2025

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA ACL 2025

COVER: Context-Driven Over-Refusal Verification in LLMs ACL 2025

EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs COLING 2025

SSA: Semantic Contamination of LLM-Driven Fake News Detection EMNLP 2025

Acquiescence Bias in Large Language Models EMNLP 2025

Evaluating Test-Time Scaling LLMs for Legal Reasoning: OpenAI o1, DeepSeek-R1, and Beyond EMNLP 2025

Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models EMNLP 2025

MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique EMNLP 2025

Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models NAACL 2025