benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

InstructPart: Task-Oriented Part Segmentation with Instruction Reasoning ACL 2025

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance EMNLP 2025

Morables: A Benchmark for Assessing Abstract Moral Reasoning in LLMs with Fables EMNLP 2025

FCMR: Robust Evaluation of Financial Cross-Modal Multi-Hop Reasoning ACL 2025

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models ACL 2025

VisFinEval: A Scenario-Driven Chinese Multimodal Benchmark for Holistic Financial Understanding EMNLP 2025

RusConText Benchmark: A Russian Language Evaluation Benchmark for Understanding Context ACL 2025

Limited Generalizability in Argument Mining: State-Of-The-Art Models Learn Datasets, Not Arguments ACL 2025

Can Large Language Models Be Good Language Teachers? EMNLP 2025

TactfulToM: Do LLMs have the Theory of Mind ability to understand White Lies? EMNLP 2025

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding EMNLP 2025

The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters ACL 2025

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation ACL 2025

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild? EMNLP 2025

Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items ACL 2025

MemeQA: Holistic Evaluation for Meme Understanding ACL 2025

Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets EMNLP 2025

DCR: Quantifying Data Contamination in LLMs Evaluation EMNLP 2025

Something’s Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks ACL 2025

WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization ACL 2025

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs NAACL 2025

GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git ACL 2025

Text2World: Benchmarking Large Language Models for Symbolic World Model Generation ACL 2025

DSBC : Data Science task Benchmarking with Context engineering AACL 2025

When2Call: When (not) to Call Tools NAACL 2025