benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions CVPR 2024

The Instinctive Bias: Spurious Images lead to Illusion in MLLMs EMNLP 2024

Evaluating Computational Representations of Character: An Austen Character Similarity Benchmark EMNLP 2024

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation EMNLP 2024

LEGOBench: Scientific Leaderboard Generation Benchmark EMNLP 2024

BLADE: Benchmarking Language Model Agents for Data-Driven Science EMNLP 2024

On Leakage of Code Generation Evaluation Datasets EMNLP 2024

TuringQ: Benchmarking AI Comprehension in Theory of Computation EMNLP 2024

ERVQA: A Dataset to Benchmark the Readiness of Large Vision Language Models in Hospital Environments EMNLP 2024

MAIR: A Massive Benchmark for Evaluating Instructed Retrieval EMNLP 2024

DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models EMNLP 2024

On Train-Test Class Overlap and Detection for Image Retrieval CVPR 2024

From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models EMNLP 2024

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models CVPR 2024

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models CVPR 2024

Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding CVPR 2024

STORYSUMM: Evaluating Faithfulness in Story Summarization EMNLP 2024

Are Large Language Models Good Statisticians? NIPS 2024

InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models NIPS 2024

A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark CVPR 2024

What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases NAACL 2024

SEED-Bench: Benchmarking Multimodal Large Language Models CVPR 2024

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification NAACL 2024

Can LLMs Solve Molecule Puzzles? A Multimodal Benchmark for Molecular Structure Elucidation NIPS 2024

Visual Riddles: a Commonsense and World Knowledge Challenge for Large Vision and Language Models NIPS 2024