benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Multi-Object Hallucination in Vision Language Models NIPS 2024

Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale EMNLP 2024

The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention EMNLP 2024

What Are the Odds? Language Models Are Capable of Probabilistic Reasoning EMNLP 2024

DF40: Toward Next-Generation Deepfake Detection NIPS 2024

V-PETL Bench: A Unified Visual Parameter-Efficient Transfer Learning Benchmark NIPS 2024

WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models ACL 2024

Probing Language Models for Pre-training Data Detection ACL 2024

Dr.Academy: A Benchmark for Evaluating Questioning Capability in Education for Large Language Models ACL 2024

TAPVid-3D: A Benchmark for Tracking Any Point in 3D NIPS 2024

THE COLOSSEUM: A Benchmark for Evaluating Generalization for Robotic Manipulation RSS 2024

Demonstrating HumanTHOR: A Simulation Platform and Benchmark for Human-Robot Collaboration in a Shared Workspace RSS 2024

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes ACL 2024

Translation Errors Significantly Impact Low-Resource Languages in Cross-Lingual Learning EACL 2024

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking EACL 2024

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models ACL 2024

Rainbow - A Benchmark for Systematic Testing of How Sensitive Visio-Linguistic Models are to Color Naming EACL 2024

Head-to-Tail: How Knowledgeable are Large Language Models (LLMs)? A.K.A. Will LLMs Replace Knowledge Graphs? NAACL 2024

MedJourney: Benchmark and Evaluation of Large Language Models over Patient Clinical Journey NIPS 2024

Benchmarking Large Language Models for Persian: A Preliminary Study Focusing on ChatGPT COLING 2024

SpaRC and SpaRP: Spatial Reasoning Characterization and Path Generation for Understanding Spatial Reasoning Capability of Large Language Models ACL 2024

Marathon: A Race Through the Realm of Long Context with Large Language Models ACL 2024

Can Large Language Models Understand Context? EACL 2024

AI-Olympics: Exploring the Generalization of Agents through Open Competitions IJCAI 2024

The Scandinavian Embedding Benchmarks: Comprehensive Assessment of Multilingual and Monolingual Text Embedding NIPS 2024