benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models IJCAI 2025

CogLM: Tracking Cognitive Development of Large Language Models NAACL 2025

EscapeBench: Towards Advancing Creative Intelligence of Language Model Agents ACL 2025

Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness NAACL 2025

Theory of Mind in Large Language Models: Assessment and Enhancement ACL 2025

DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems NAACL 2025

RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios ACL 2025

When2Call: When (not) to Call Tools NAACL 2025

REAL-MM-RAG: A Real-World Multi-Modal Retrieval Benchmark ACL 2025

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation ACL 2025

CULEMO: Cultural Lenses on Emotion - Benchmarking LLMs for Cross-Cultural Emotion Understanding ACL 2025

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git ACL 2025

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration ACL 2025

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era ACL 2025

ActiView: Evaluating Active Perception Ability for Multimodal Large Language Models ACL 2025

GIMMICK: Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking ACL 2025

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation CVPR 2025

WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization ACL 2025

FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation ACL 2025

Something’s Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks ACL 2025

HATS : Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models ACL 2025

MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark ACL 2025

How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation ACL 2025

Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean ACL 2025

Are Bias Evaluation Methods Biased ? ACL 2025