Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Keywords
benchmark evaluation
1539 papers
Explore in graph
Also known as
MT-BENCH
BDC
Co-occurring keywords
large language model
(12755)
question answering
(2904)
multimodal learning
(4622)
language model
(4573)
multimodal large language model
(865)
vision-language model
(2235)
visual question answering
(1000)
evaluation benchmark
(250)
multilingual nlp
(1423)
benchmark dataset
(619)
Papers
Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation
AACL 2025
Current Semantic-change Quantification Methods Struggle with Discovery in the Wild
EMNLP 2025
A benchmark for end-to-end zero-shot biomedical relation extraction with LLMs: experiments with OpenAI models
AACL 2025
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
EMNLP 2024
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
EMNLP 2024
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
EMNLP 2024
Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs
EMNLP 2024
QGEval: Benchmarking Multi-dimensional Evaluation for Question Generation
EMNLP 2024
PromISe: Releasing the Capabilities of LLMs with Prompt Introspective Search
COLING 2024
Towards a Danish Semantic Reasoning Benchmark - Compiled from Lexical-Semantic Resources for Assessing Selected Language Understanding Capabilities of Large Language Models
COLING 2024
Vygotsky Distance: Measure for Benchmark Task Similarity
COLING 2024
Navigating the Modern Evaluation Landscape: Considerations in Benchmarks and Frameworks for Large Language Models (LLMs)
COLING 2024
Ukrainian Visual Word Sense Disambiguation Benchmark
COLING 2024
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
ACL 2024
Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark
ACL 2024
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
ACL 2024
M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
ACL 2024
CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
ACL 2024
ProxyQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models
ACL 2024
M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought
ACL 2024
MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception
ACL 2024
CODIS: Benchmarking Context-dependent Visual Comprehension for Multimodal Large Language Models
ACL 2024
DOSA: A Dataset of Social Artifacts from Different Indian Geographical Subcultures
COLING 2024
Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains
COLING 2024
MVP: Minimal Viable Phrase for Long Text Understanding
COLING 2024
<
1
…
32
33
34
…
62
>