benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

MIBench: Evaluating Multimodal Large Language Models over Multiple Images EMNLP 2024

OpenT2T: An Open-Source Toolkit for Table-to-Text Generation EMNLP 2024

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation EMNLP 2024

OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models ACL 2024

SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation NIPS 2024

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty NIPS 2024

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? ACL 2024

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks ACL 2024

CLAMBER: A Benchmark of Identifying and Clarifying Ambiguous Information Needs in Large Language Models ACL 2024

CLOMO: Counterfactual Logical Modification with Large Language Models ACL 2024

Benchmarking Chinese Commonsense Reasoning of LLMs: From Chinese-Specifics to Reasoning-Memorization Correlations ACL 2024

VisDiaHalBench: A Visual Dialogue Benchmark For Diagnosing Hallucination in Large Vision-Language Models ACL 2024

Living in the Moment: Can Large Language Models Grasp Co-Temporal Reasoning? ACL 2024

LLMArena: Assessing Capabilities of Large Language Models in Dynamic Multi-Agent Environments ACL 2024

ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models ACL 2024

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark ACL 2024

E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models ACL 2024

NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries ACL 2024

Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness ACL 2024

PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion ACL 2024

TempCompass: Do Video LLMs Really Understand Videos? ACL 2024

Data Contamination Calibration for Black-box LLMs ACL 2024

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction ACL 2024

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models ACL 2024

MM-LLMs: Recent Advances in MultiModal Large Language Models ACL 2024