model evaluation

442 papers

Explore in graph

Co-occurring keywords

large language model (12755) benchmark evaluation (1539) text classification (6776) language model (4573) natural language processing (2027) evaluation benchmark (250) natural language inference (1278) multimodal learning (4622) bias detection (419) transfer learning (5442)

Papers

Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models ACL 2025

CAST: Cross-modal Alignment Similarity Test for Vision Language Models COLING 2025

Benchmarking Distributional Alignment of Large Language Models NAACL 2025

Maximizing Signal in Human-Model Preference Alignment AAAI 2025

A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models ACL 2025

Style Over Substance: Evaluation Biases for Large Language Models COLING 2025

JuStRank: Benchmarking LLM Judges for System Ranking ACL 2025

LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts EMNLP 2025

Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models ACL 2025

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs ACL 2025

Evaluating Sensitivity Consistency of Explanations WACV 2025

RCScore: Quantifying Response Consistency in Large Language Models EMNLP 2025

BOSE: A Systematic Evaluation Method Optimized for Base Models ACL 2025

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA ACL 2025

Adaptively profiling models with task elicitation EMNLP 2025

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses AAAI 2025

Investigating How Pre-training Data Leakage Affects Models’ Reproduction and Detection Capabilities EMNLP 2025

Transitive self-consistency evaluation of NLI models without gold labels EMNLP 2025

Contamination Budget: Trade-offs Between Breadth, Depth and Difficulty IJCAI 2025

SYNC: A Synthetic Long-Context Understanding Benchmark for Controlled Comparisons of Model Capabilities EMNLP 2025

OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages EMNLP 2025

PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications EMNLP 2025

Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices EMNLP 2025

Adversarial Robustness of Discriminative Self-Supervised Learning in Vision ICCV 2025

DefVerify: Do Hate Speech Models Reflect Their Dataset’s Definition? COLING 2025