benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

PARAPHRASUS: A Comprehensive Benchmark for Evaluating Paraphrase Detection Models COLING 2025

EXCGEC: A Benchmark for Edit-Wise Explainable Chinese Grammatical Error Correction AAAI 2025

DHP Benchmark: Are LLMs Good NLG Evaluators? NAACL 2025

Interactive Evaluation for Medical LLMs via Task-oriented Dialogue System COLING 2025

Bilingual BSARD: Extending Statutory Article Retrieval to Dutch COLING 2025

Testing the Boundaries of LLMs: Dialectal and Language-Variety Tasks COLING 2025

M2RC-EVAL: Massively Multilingual Repository-level Code Completion Evaluation ACL 2025

Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking COLING 2025

DICE-BENCH: Evaluating the Tool-Use Capabilities of Large Language Models in Multi-Round, Multi-Party Dialogues ACL 2025

A Benchmark for Hindi Verb-Argument Structure Alternations EMNLP 2025

Findings of the Third BabyLM Challenge: Accelerating Language Modeling Research with Cognitively Plausible Data EMNLP 2025

Evaluating Health Question Answering Under Readability-Controlled Style Perturbations EMNLP 2025

MMRefine: Unveiling the Obstacles to Robust Refinement in Multimodal Large Language Models ACL 2025

Do not Abstain! Identify and Solve the Uncertainty ACL 2025

CBT-Bench: Evaluating Large Language Models on Assisting Cognitive Behavior Therapy NAACL 2025

CVE-Bench: Benchmarking LLM-based Software Engineering Agent’s Ability to Repair Real-World CVE Vulnerabilities NAACL 2025

Improving Model Evaluation using SMART Filtering of Benchmark Datasets NAACL 2025

MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria NAACL 2025

EmoCharacter: Evaluating the Emotional Fidelity of Role-Playing Agents in Dialogues NAACL 2025

GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models NAACL 2025

MADial-Bench: Towards Real-world Evaluation of Memory-Augmented Dialogue Generation NAACL 2025

Measuring and Benchmarking Large Language Models’ Capabilities to Generate Persuasive Language NAACL 2025

MILU: A Multi-task Indic Language Understanding Benchmark NAACL 2025

Lived Experience Not Found: LLMs Struggle to Align with Experts on Addressing Adverse Drug Reactions from Psychiatric Medication Use NAACL 2025

Surge: On the Potential of Large Language Models as General-Purpose Surrogate Code Executors EMNLP 2025