benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation AACL 2025

Current Semantic-change Quantification Methods Struggle with Discovery in the Wild EMNLP 2025

A benchmark for end-to-end zero-shot biomedical relation extraction with LLMs: experiments with OpenAI models AACL 2025

Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards EMNLP 2024

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models EMNLP 2024

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading EMNLP 2024

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs EMNLP 2024

QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation EMNLP 2024

PromISe: Releasing the Capabilities of LLMs with Prompt Introspective Search COLING 2024

Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models COLING 2024

Vygotsky Distance: Measure for Benchmark Task Similarity COLING 2024

Navigating the Modern Evaluation Landscape: Considerations in Benchmarks and Frameworks for Large Language Models (LLMs) COLING 2024

Ukrainian Visual Word Sense Disambiguation Benchmark COLING 2024

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding ACL 2024

Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark ACL 2024

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems ACL 2024

M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection ACL 2024

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation ACL 2024

ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models ACL 2024

M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought ACL 2024

MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception ACL 2024

CODIS: Benchmarking Context-dependent Visual Comprehension for Multimodal Large Language Models ACL 2024

DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures COLING 2024

Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains COLING 2024

MVP: Minimal Viable Phrase for Long Text Understanding COLING 2024