benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Rethink DARTS Search Space and Renovate a New Benchmark ICML 2023

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation ICML 2023

GraphCleaner: Detecting Mislabelled Samples in Popular Graph Learning Benchmarks ICML 2023

Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection ICCV 2023

Evaluating context-invariance in unsupervised speech representations INTERSPEECH 2023

Svarah: Evaluating English ASR Systems on Indian Accents INTERSPEECH 2023

DFEE: Interactive DataFlow Execution and Evaluation Kit AAAI 2023

Why Aren’t We NER Yet? Artifacts of ASR Errors in Named Entity Recognition in Spontaneous Speech Transcripts ACL 2023

Exploring the Capacity of Pretrained Language Models for Reasoning about Actions and Change ACL 2023

Benchmarking Large Language Model Capabilities for Conditional Generation ACL 2023

What’s the Meaning of Superhuman Performance in Today’s NLU? ACL 2023

Quantifying Train-Evaluation Overlap with Nearest Neighbors ACL 2023

Measuring Progress in Fine-grained Vision-and-Language Understanding ACL 2023

What Do NLP Researchers Believe? Results of the NLP Community Metasurvey ACL 2023

Can Language Models Be Specific? How? ACL 2023

When to Use What: An In-Depth Comparative Empirical Analysis of OpenIE Systems for Downstream Applications ACL 2023

READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises ACL 2023

Towards Reasoning in Large Language Models: A Survey ACL 2023

The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation ACL 2023

ORCA: A Challenging Benchmark for Arabic Language Understanding ACL 2023

LMentry: A Language Model Benchmark of Elementary Language Tasks ACL 2023

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark ACL 2023

Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking ACL 2023

MoQA: Benchmarking Multi-Type Open-Domain Question Answering ACL 2023

Language models are not naysayers: an analysis of language models on negation benchmarks ACL 2023