benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Transitioning from benchmarks to a real-world case of information-seeking in Scientific Publications ACL 2023

Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation ACL 2023

Do Question Answering Modeling Improvements Hold Across Benchmarks? ACL 2023

How Predictable Are Large Language Model Capabilities? A Case Study on BIG-bench EMNLP 2023

USB: A Unified Summarization Benchmark Across Tasks and Domains EMNLP 2023

Is GPT-4 a Good Data Analyst? EMNLP 2023

SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research EMNLP 2023

Elo Uncovered: Robustness and Best Practices in Language Model Evaluation EMNLP 2023

Post Turing: Mapping the landscape of LLM Evaluation EMNLP 2023

Towards Explainable and Accessible AI EMNLP 2023

How hard are computer vision datasets? Calibrating dataset difficulty to viewing time NIPS 2023

OpenSTL: A Comprehensive Benchmark of Spatio-Temporal Predictive Learning NIPS 2023

What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation NIPS 2023

OpenDataVal: a Unified Benchmark for Data Valuation NIPS 2023

Lo-Hi: Practical ML Drug Discovery Benchmark NIPS 2023

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion NIPS 2023

Med-HALT: Medical Domain Hallucination Test for Large Language Models CONLL 2023

Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests CONLL 2023

The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks CONLL 2023

Revealing Weaknesses of Vietnamese Language Models Through Unanswerable Questions in Machine Reading Comprehension EACL 2023

On Evaluation of Document Classification with RVL-CDIP EACL 2023

Empathy Identification Systems are not Accurately Accounting for Context EACL 2023

It’s about Time: Rethinking Evaluation on Rumor Detection Benchmarks using Chronological Splits EACL 2023

Distinguishing Cause and Effect in Bivariate Structural Causal Models: A Systematic Investigation JMLR 2023

Atari-5: Distilling the Arcade Learning Environment down to Five Games ICML 2023