benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

NusaCrowd: Open Source Initiative for Indonesian NLP Resources ACL 2023

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs EMNLP 2023

Evaluating Large Language Models on Controlled Generation Tasks EMNLP 2023

Can language models learn analogical reasoning? Investigating training objectives and comparisons to human performance EMNLP 2023

GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP EMNLP 2023

We’re Afraid Language Models Aren’t Modeling Ambiguity EMNLP 2023

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks EMNLP 2023

Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models EMNLP 2023

A Comprehensive Benchmark for Neural Human Radiance Fields NIPS 2023

Benchmarking Foundation Models with Language-Model-as-an-Examiner NIPS 2023

OceanBench: The Sea Surface Height Edition NIPS 2023

REASONER: An Explainable Recommendation Dataset with Comprehensive Labeling Ground Truths NIPS 2023

Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images NIPS 2023

VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models NIPS 2023

LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models NIPS 2023

MultiRobustBench: Benchmarking Robustness Against Multiple Attacks ICML 2023

Benchmarking Diverse-Modal Entity Linking with Generative Models ACL 2023

Revisiting Scene Text Recognition: A Data Perspective ICCV 2023

Benchmarking Long-tail Generalization with Likelihood Splits EACL 2023

MT-GenEval: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation EMNLP 2022

ACL Tutorial Proposal: Towards Reproducible Machine Learning Research in Natural Language Processing ACL 2022

Nibbling at the Hard Core of Word Sense Disambiguation ACL 2022

SRL4E – Semantic Role Labeling for Emotions: A Unified Evaluation Framework ACL 2022

VALUE: Understanding Dialect Disparity in NLU ACL 2022

NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks ACL 2022