benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Evaluating Multimodal Large Language Models on Video Captioning via Monte Carlo Tree Search ACL 2025

KMMLU: Measuring Massive Multitask Language Understanding in Korean NAACL 2025

PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory ACL 2025

Benchmarking Distributional Alignment of Large Language Models NAACL 2025

MatViX: Multimodal Information Extraction from Visually Rich Articles NAACL 2025

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions NAACL 2025

Find the Intention of Instruction: Comprehensive Evaluation of Instruction Understanding for Large Language Models NAACL 2025

RusCode: Russian Cultural Code Benchmark for Text-to-Image Generation NAACL 2025

LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models NAACL 2025

AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation NAACL 2025

DHP Benchmark: Are LLMs Good NLG Evaluators? NAACL 2025

SimulBench: Evaluating Language Models with Creative Simulation Tasks NAACL 2025

Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications NAACL 2025

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications? NAACL 2025

Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance NAACL 2025

LongSafety: Enhance Safety for Long-Context LLMs ACL 2025

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios ACL 2025

Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs ACL 2025

CliME: Evaluating Multimodal Climate Discourse on Social Media and the Climate Alignment Quotient (CAQ) ACL 2025

MELABenchv1: Benchmarking Large Language Models against Smaller Fine-Tuned Models for Low-Resource Maltese NLP ACL 2025

Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’ ACL 2025

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation ACL 2025

FrontierScience Bench: Evaluating AI Research Capabilities in LLMs ACL 2025

SITE: towards Spatial Intelligence Thorough Evaluation ICCV 2025

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding EMNLP 2025