benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

PTEB: Towards Robust Text Embedding Evaluation via Stochastic Paraphrasing at Evaluation Time with LLMs EACL 2026

Sudanese-Flores: Extending FLORES+ to Sudanese Arabic Dialect EACL 2026

MicroEvoEval: A Systematic Evaluation Framework for Image-Based Microstructure Evolution Prediction AAAI 2026

MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains AAAI 2026

Complex Mathematical Expression Recognition: Benchmark, Large-Scale Dataset and Strong Baseline AAAI 2026

UQ-Bench: A Benchmark for Evaluating Multimodal LLMs on Underwater Image Quality Assessment AAAI 2026

TIME: Temporal-Sensitive Multi-Dimensional Instruction Tuning and Robust Benchmarking for Video-LLMs AAAI 2026

VP-Bench: A Comprehensive Benchmark for Visual Prompting in Multimodal Large Language Models AAAI 2026

LLMTM: Benchmarking and Optimizing LLMs for Temporal Motif Analysis in Dynamic Graphs AAAI 2026

Do Large Language Models Reason About Uncertainty Like Humans? A Benchmark on Hurricane Forecast Visualization Comprehension AAAI 2026

Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-Based Test Oracles AAAI 2026

SpatialLogic-Bench: A Diagnostic Benchmark for Task-Oriented Spatiotemporal Reasoning AAAI 2026

Do Generative Video Models Understand Physical Principles? WACV 2026

BOP-Distrib: Revisiting 6D Pose Estimation Benchmarks for Better Evaluation under Visual Ambiguities WACV 2026

PhyEduVideo: A Benchmark for Evaluating Text-to-Video Models for Physics Education WACV 2026

HumanBench: Two Heads, No Legs, But Mostly Human, the State of Generative Capabilities in T2I Models WACV 2026

MarineEval: Assessing the Marine Intelligence of Vision-Language Models WACV 2026

T2-RAGBench: Text-and-Table Benchmark for Evaluating Retrieval-Augmented Generation EACL 2026

Cetvel: A Unified Benchmark for Evaluating Language Understanding, Generation and Cultural Capacity of LLMs for Turkish EACL 2026

Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities EACL 2026

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation EACL 2026

BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models EACL 2026

Nahw: A Comprehensive Benchmark of Arabic Grammar Understanding, Error Detection, Correction, and Explanation EACL 2026

Beyond Blind Following: Evaluating Robustness of LLM Agents under Imperfect Guidance EACL 2026

ConInstruct: Evaluating Large Language Models on Conflict Detection and Resolution in Instructions AAAI 2026