benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR EACL 2026

TeluguEval: A Comprehensive Benchmark for Evaluating LLM Capabilities in Telugu EACL 2026

Vinclat: Evaluating Reasoning, Cognition and Culture in One Game EACL 2026

TurkBench: A Benchmark for Evaluating Turkish Large Language Models EACL 2026

OCRTurk: A Comprehensive OCR Benchmark for Turkish EACL 2026

Do Generative Video Models Understand Physical Principles? WACV 2026

BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities WACV 2026

PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education WACV 2026

HumanBench: Two Heads, No Legs, But Mostly Human, the State of Generative Capabilities in T2I Models WACV 2026

MarineEval: Assessing the Marine Intelligence of Vision-Language Models WACV 2026

T2-RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation EACL 2026

Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish EACL 2026

Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities EACL 2026

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation EACL 2026

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models EACL 2026

Nahw: A Comprehensive Benchmark of Arabic Grammar Understanding, Error Detection, Correction, and Explanation EACL 2026

Beyond Blind Following: Evaluating Robustness of LLM Agents under Imperfect Guidance EACL 2026

How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities EACL 2026

Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties EACL 2026

Do Multi-Agents Solve Better Than Single? Evaluating Agentic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning EACL 2026

A Benchmark and Evaluation of Automated Language of Study Extraction from Computational Linguistics Publications EACL 2026

Garbage In, Reasoning Out? Why Benchmark Scores are Unreliable and What to Do About It EACL 2026

WebNovelBench: Placing LLM Novelists on the Web Novel Distribution EACL 2026

KGHaluBench: A Knowledge Graph-Based Hallucination Benchmark for Evaluating the Breadth and Depth of LLM Knowledge EACL 2026

The Art of Saying "Maybe": A Conformal Lens for Uncertainty Benchmarking in VLMs EACL 2026