benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension ACL 2025

ELAB: Extensive LLM Alignment Benchmark in Persian Language ACL 2025

HuGME: A benchmark system for evaluating Hungarian generative LLMs ACL 2025

LLMs can be easily Confused by Instructional Distractions ACL 2025

TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages ACL 2025

SECQUE: A Benchmark for Evaluating Real-World Financial Analysis Capabilities ACL 2025

KazBench-KK: A Cultural-Knowledge Benchmark for Kazakh ACL 2025

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models ACL 2025

HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs ACL 2025

ClimateEval: A Comprehensive Benchmark for NLP Tasks Related to Climate Change ACL 2025

ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models ACL 2025

Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks COLING 2025

“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor ACL 2025

Questioning Our Questions: How Well Do Medical QA Benchmarks Evaluate Clinical Capabilities of Language Models? ACL 2025

MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models ACL 2025

Do not Abstain! Identify and Solve the Uncertainty ACL 2025

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues ACL 2025

Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation ACL 2025

M2RC-EVAL: Massively Multilingual Repository-level Code Completion Evaluation ACL 2025

Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking COLING 2025

EXCGEC: A Benchmark for Edit-Wise Explainable Chinese Grammatical Error Correction AAAI 2025

KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan ACL 2025

How does Misinformation Affect Large Language Model Behaviors and Preferences? ACL 2025

Information Density Principle for MLLM Benchmarks ICCV 2025

Are We Done with MMLU? NAACL 2025