benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure ACL 2025

MLAlgo-Bench: Can Machines Implement Machine Learning Algorithms? EMNLP 2025

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning ACL 2025

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling ACL 2025

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation ACL 2025

Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents ACL 2025

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs ACL 2025

CULEMO: Cultural Lenses on Emotion - Benchmarking LLMs for Cross-Cultural Emotion Understanding ACL 2025

MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation ACL 2025

TUMLU: A Unified and Native Language Understanding Benchmark for Turkic Languages ACL 2025

“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor ACL 2025

Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States ACL 2025

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models ACL 2025

ChatBench: From Static Benchmarks to Human-AI Evaluation ACL 2025

InductionBench: LLMs Fail in the Simplest Complexity Class ACL 2025

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits ACL 2025

PlanningArena: A Modular Benchmark for Multidimensional Evaluation of Planning and Tool Learning ACL 2025

Batayan: A Filipino NLP benchmark for evaluating Large Language Models ACL 2025

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging ACL 2025

SCITAT: A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types ACL 2025

AceMath: Advancing Frontier Math Reasoning with Post-Training and Reward Modeling ACL 2025

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering ACL 2025

Awes, Laws, and Flaws From Today’s LLM Research ACL 2025

See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models ACL 2025

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA EMNLP 2025