Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Keywords
model evaluation
442 papers
Explore in graph
Co-occurring keywords
large language model
(12755)
benchmark evaluation
(1539)
text classification
(6776)
language model
(4573)
natural language processing
(2027)
evaluation benchmark
(250)
natural language inference
(1278)
multimodal learning
(4622)
bias detection
(419)
transfer learning
(5442)
Papers
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
ACL 2025
CAST: Cross-modal Alignment Similarity Test for Vision Language Models
COLING 2025
Benchmarking Distributional Alignment of Large Language Models
NAACL 2025
Maximizing Signal in Human-Model Preference Alignment
AAAI 2025
A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models
ACL 2025
Style Over Substance: Evaluation Biases for Large Language Models
COLING 2025
JuStRank: Benchmarking LLM Judges for System Ranking
ACL 2025
LLMsPark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts
EMNLP 2025
Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models
ACL 2025
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
ACL 2025
Evaluating Sensitivity Consistency of Explanations
WACV 2025
RCScore: Quantifying Response Consistency in Large Language Models
EMNLP 2025
BOSE: A Systematic Evaluation Method Optimized for Base Models
ACL 2025
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
ACL 2025
Adaptively profiling models with task elicitation
EMNLP 2025
SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
AAAI 2025
Investigating How Pre-training Data Leakage Affects Models’ Reproduction and Detection Capabilities
EMNLP 2025
Transitive self-consistency evaluation of NLI models without gold labels
EMNLP 2025
Contamination Budget: Trade-offs Between Breadth, Depth and Difficulty
IJCAI 2025
SYNC: A Synthetic Long-Context Understanding Benchmark for Controlled Comparisons of Model Capabilities
EMNLP 2025
OpenNER 1.0: Standardized Open-Access Named Entity Recognition Datasets in 50+ Languages
EMNLP 2025
PCRI: Measuring Context Robustness in Multimodal Models for Enterprise Applications
EMNLP 2025
Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices
EMNLP 2025
Adversarial Robustness of Discriminative Self-Supervised Learning in Vision
ICCV 2025
DefVerify: Do Hate Speech Models Reflect Their Dataset’s Definition?
COLING 2025
<
1
2
3
4
5
…
18
>