benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

What can Large Language Models Capture about Code Functional Equivalence? NAACL 2025

ComprehendEdit: A Comprehensive Dataset and Evaluation Framework for Multimodal Knowledge Editing AAAI 2025

Expanding the WMT24++ Benchmark with Rumantsch Grischun, Sursilvan, Sutsilvan, Surmiran, Puter, and Vallader EMNLP 2025

Keep Guessing? When Considering Inference Scaling, Mind the Baselines NAACL 2025

GuessArena: Guess Who I Am? A Self-Adaptive Framework for Evaluating LLMs in Domain-Specific Knowledge and Reasoning ACL 2025

Findings of the WMT25 Multilingual Instruction Shared Task: Persistent Hurdles in Reasoning, Generation, and Evaluation EMNLP 2025

Consensus or Conflict? Fine-Grained Evaluation of Conflicting Answers in Question-Answering EMNLP 2025

UCFE: A User-Centric Financial Expertise Benchmark for Large Language Models NAACL 2025

HALLUCINOGEN: Benchmarking Hallucination in Implicit Reasoning within Large Vision Language Models EMNLP 2025

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation CVPR 2025

MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains NAACL 2025

MCS-Bench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in Chinese Classical Studies ACL 2025

LLM-Coordination: Evaluating and Analyzing Multi-agent Coordination Abilities in Large Language Models NAACL 2025

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages CVPR 2025

CheckManual: A New Challenge and Benchmark for Manual-based Appliance Manipulation CVPR 2025

Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks NAACL 2025

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding? CVPR 2025

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection CVPR 2025

VLind-Bench: Measuring Language Priors in Large Vision-Language Models NAACL 2025

TripleFact: Defending Data Contamination in the Evaluation of LLM-driven Fake News Detection ACL 2025

NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models NAACL 2025

Q-Bench-Video: Benchmark the Video Quality Understanding of LMMs CVPR 2025

CodeRAG-Bench: Can Retrieval Augment Code Generation? NAACL 2025

Unveiling the Ignorance of MLLMs: Seeing Clearly, Answering Incorrectly CVPR 2025

Mind the Gap: Static and Interactive Evaluations of Large Audio Models ACL 2025