benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

„Mann“ is to “Donna” as「国王」is to « Reine » Adapting the Analogy Task for Multilingual and Contextual Embeddings ACL 2023

True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3 and Challenging for GPT-4 ACL 2023

ALCUNA: Large Language Models Meet New Knowledge EMNLP 2023

Evaluating Cross-Domain Text-to-SQL Models and Benchmarks EMNLP 2023

The Skipped Beat: A Study of Sociopragmatic Understanding in LLMs for 64 Languages EMNLP 2023

Do Language Models Have a Common Sense regarding Time? Revisiting Temporal Commonsense Reasoning in the Era of Large Language Models EMNLP 2023

An Investigation of LLMs’ Inefficacy in Understanding Converse Relations EMNLP 2023

EpiK-Eval: Evaluation for Language Models as Epistemic Models EMNLP 2023

Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU EMNLP 2023

Can We Edit Multimodal Large Language Models? EMNLP 2023

FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions EMNLP 2023

CRAB: Assessing the Strength of Causal Relationships Between Real-world Events EMNLP 2023

CLEVA: Chinese Language Models EVAluation Platform EMNLP 2023

ZhuJiu: A Multi-dimensional, Multi-faceted Chinese Benchmark for Large Language Models EMNLP 2023

Are ChatGPT and GPT-4 General-Purpose Solvers for Financial Text Analytics? A Study on Several Typical Tasks EMNLP 2023

Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models EMNLP 2023

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages EMNLP 2023

Toward Stronger Textual Attack Detectors EMNLP 2023

SciRepEval: A Multi-Format Benchmark for Scientific Document Representations EMNLP 2023

The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks EMNLP 2023

90% F1 Score in Relation Triple Extraction: Is it Real? EMNLP 2023

A Fair and In-Depth Evaluation of Existing End-to-End Entity Linking Systems EMNLP 2023

On the Challenges of Using Black-Box APIs for Toxicity Evaluation in Research EMNLP 2023

CREPE: Can Vision-Language Foundation Models Reason Compositionally? CVPR 2023

GLEMOS: Benchmark for Instantaneous Graph Learning Model Selection NIPS 2023