benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages EACL 2026

Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties EACL 2026

PerVL-Bench: Benchmarking Multimodal Personalization for Large Vision-Language Models WACV 2026

ReMedQA: Are We Done With Medical Multiple-Choice Benchmarks? EACL 2026

Beyond Blind Following: Evaluating Robustness of LLM Agents under Imperfect Guidance EACL 2026

How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities EACL 2026

M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models WACV 2026

OpenLVLM-MIA: A Controlled Benchmark Revealing the Limits of Membership Inference Attacks on Large Vision-Language Models WACV 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models EACL 2026

Patient-Similarity Cohort Reasoning in Clinical Text-to-SQL EACL 2026

Taxation Perspectives from Large Language Models: A Case Study on Additional Tax Penalties EACL 2026

Nahw: A Comprehensive Benchmark of Arabic Grammar Understanding, Error Detection, Correction, and Explanation EACL 2026

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases EACL 2026

Do You See Me : A Multidimensional Benchmark for Evaluating Visual Perception in Multimodal LLMs EACL 2026

Do Generative Video Models Understand Physical Principles? WACV 2026

BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities WACV 2026

PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education WACV 2026

HumanBench: Two Heads, No Legs, But Mostly Human, the State of Generative Capabilities in T2I Models WACV 2026

MarineEval: Assessing the Marine Intelligence of Vision-Language Models WACV 2026

T2-RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation EACL 2026

Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish EACL 2026

Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities EACL 2026

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation EACL 2026

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models EACL 2026

When Benchmarks Age: Temporal Misalignment through Large Language Model Factuality Evaluation EACL 2026