Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Keywords
benchmark evaluation
1539 papers
Explore in graph
Also known as
MT-BENCH
BDC
Co-occurring keywords
large language model
(12755)
question answering
(2904)
multimodal learning
(4622)
language model
(4573)
multimodal large language model
(865)
vision-language model
(2235)
visual question answering
(1000)
evaluation benchmark
(250)
multilingual nlp
(1423)
benchmark dataset
(619)
Papers
Rethink DARTS Search Space and Renovate a New Benchmark
ICML 2023
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation
ICML 2023
GraphCleaner: Detecting Mislabelled Samples in Popular Graph Learning Benchmarks
ICML 2023
Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection
ICCV 2023
Evaluating context-invariance in unsupervised speech representations
INTERSPEECH 2023
Svarah: Evaluating English ASR Systems on Indian Accents
INTERSPEECH 2023
DFEE: Interactive DataFlow Execution and Evaluation Kit
AAAI 2023
Why Aren’t We NER Yet? Artifacts of ASR Errors in Named Entity Recognition in Spontaneous Speech Transcripts
ACL 2023
Exploring the Capacity of Pretrained Language Models for Reasoning about Actions and Change
ACL 2023
Benchmarking Large Language Model Capabilities for Conditional Generation
ACL 2023
What’s the Meaning of Superhuman Performance in Today’s NLU?
ACL 2023
Quantifying Train-Evaluation Overlap with Nearest Neighbors
ACL 2023
Measuring Progress in Fine-grained Vision-and-Language Understanding
ACL 2023
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey
ACL 2023
Can Language Models Be Specific? How?
ACL 2023
When to Use What: An In-Depth Comparative Empirical Analysis of OpenIE Systems for Downstream Applications
ACL 2023
READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises
ACL 2023
Towards Reasoning in Large Language Models: A Survey
ACL 2023
The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation
ACL 2023
ORCA: A Challenging Benchmark for Arabic Language Understanding
ACL 2023
LMentry: A Language Model Benchmark of Elementary Language Tasks
ACL 2023
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
ACL 2023
Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
ACL 2023
MoQA: Benchmarking Multi-Type Open-Domain Question Answering
ACL 2023
Language models are not naysayers: an analysis of language models on negation benchmarks
ACL 2023
<
1
…
51
52
53
…
62
>