benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

LEXTREME: A Multi-Lingual and Multi-Task Benchmark for the Legal Domain EMNLP 2023

Pseudointelligence: A Unifying Lens on Language Model Evaluation EMNLP 2023

ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding EMNLP 2023

Zero-shot Topical Text Classification with LLMs - an Experimental Study EMNLP 2023

NEWTON: Are Large Language Models Capable of Physical Reasoning? EMNLP 2023

NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark EMNLP 2023

Battle of the Large Language Models: Dolly vs LLaMA vs Vicuna vs Guanaco vs Bard vs ChatGPT - A Text-to-SQL Parsing Comparison EMNLP 2023

CompleQA: Benchmarking the Impacts of Knowledge Graph Completion Methods on Question Answering EMNLP 2023

Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic EMNLP 2023

AlGhafa Evaluation Benchmark for Arabic Language Models EMNLP 2023

Med-HALT: Medical Domain Hallucination Test for Large Language Models EMNLP 2023

Evaluating Neural Language Models as Cognitive Models of Language Acquisition EMNLP 2023

On using distribution-based compositionality assessment to evaluate compositional generalisation in machine translation EMNLP 2023

A Whac-a-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others CVPR 2023

Large Language Models Can Be Easily Distracted by Irrelevant Context ICML 2023

MTEB: Massive Text Embedding Benchmark EACL 2023

Fighting FIRe with FIRE: Assessing the Validity of Text-to-Video Retrieval Benchmarks EACL 2023

The UNLP 2023 Shared Task on Grammatical Error Correction for Ukrainian EACL 2023

CodaLab Competitions: An Open Source Platform to Organize Scientific Challenges JMLR 2023

HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models ICCV 2023

The OPUS-MT Dashboard – A Toolkit for a Systematic Evaluation of Open Machine Translation Models ACL 2023

A Systematic Study and Comprehensive Evaluation of ChatGPT on Benchmark Datasets ACL 2023

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them ACL 2023

WikiHowQA: A Comprehensive Benchmark for Multi-Document Non-Factoid Question Answering ACL 2023

Rogue Scores ACL 2023