benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games EMNLP 2025

Beyond the Haystack: Sensitivity to Context in Legal Reference Recall EMNLP 2025

Reefknot: A Comprehensive Benchmark for Relation Hallucination Evaluation, Analysis and Mitigation in Multimodal Large Language Models ACL 2025

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions ACL 2025

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA EMNLP 2025

R2I-Bench: Benchmarking Reasoning-Driven Text-to-Image Generation EMNLP 2025

Can LLMs Help You at Work? A Sandbox for Evaluating LLM Agents in Enterprise Environments EMNLP 2025

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain EMNLP 2025

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation EMNLP 2025

Current Semantic-change Quantification Methods Struggle with Discovery in the Wild EMNLP 2025

LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research EMNLP 2025

UBench: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions ACL 2025

Audio-centric Video Understanding Benchmark without Text Shortcut EMNLP 2025

ARC ‘Challenge’ Is Not That Challenging ACL 2025

PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization EMNLP 2025

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text EMNLP 2025

TurnBack: A Geospatial Route Cognition Benchmark for Large Language Models through Reverse Route EMNLP 2025

BioHopR: A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedical Domain ACL 2025

Can Large Language Models Win the International Mathematical Games? EMNLP 2025

WebUIBench: A Comprehensive Benchmark for Evaluating Multimodal Large Language Models in WebUI-to-Code ACL 2025

Deriving Strategic Market Insights with Large Language Models: A Benchmark for Forward Counterfactual Generation EMNLP 2025

MMInA: Benchmarking Multihop Multimodal Internet Agents ACL 2025

Examining False Positives under Inference Scaling for Mathematical Reasoning EMNLP 2025

PersonaLens: A Benchmark for Personalization Evaluation in Conversational AI Assistants ACL 2025

Benchmarking LLMs on Semantic Overlap Summarization EMNLP 2025