benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Fundamental Capabilities of Large Language Models and their Applications in Domain Scenarios: A Survey ACL 2024

II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models NIPS 2024

LHMKE: A Large-scale Holistic Multi-subject Knowledge Evaluation Benchmark for Chinese Large Language Models COLING 2024

AlignBench: Benchmarking Chinese Alignment of Large Language Models ACL 2024

Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark ACL 2024

Benchmarking Retrieval-Augmented Generation for Medicine ACL 2024

GLBench: A Comprehensive Benchmark for Graph with Large Language Models NIPS 2024

An Examination of the Compositionality of Large Generative Vision-Language Models NAACL 2024

Toward Informal Language Processing: Knowledge of Slang in Large Language Models NAACL 2024

BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer NAACL 2024

Understanding the Capabilities and Limitations of Large Language Models for Cultural Commonsense NAACL 2024

SuperGLEBer: German Language Understanding Evaluation Benchmark NAACL 2024

Investigating Data Contamination in Modern Benchmarks for Large Language Models NAACL 2024

InstructEval: Systematic Evaluation of Instruction Selection Methods NAACL 2024

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code ACL 2024

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research AAAI 2024

CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models EMNLP 2024

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection AAAI 2024

LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction AAAI 2024

Towards Reproducible, Automated, and Scalable Anomaly Detection AAAI 2024

Benchmarking Large Language Models on CFLUE - A Chinese Financial Language Understanding Evaluation Dataset ACL 2024

VariErr NLI: Separating Annotation Error from Human Label Variation ACL 2024

Benchmarking and Improving Long-Text Translation with Large Language Models ACL 2024

Reasoning Paths with Reference Objects Elicit Quantitative Spatial Reasoning in Large Vision-Language Models EMNLP 2024

The Devil is in the Fine-Grained Details: Evaluating Open-Vocabulary Object Detectors for Fine-Grained Understanding CVPR 2024