benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Benchmarking Uncertainty Disentanglement: Specialized Uncertainties for Specialized Tasks NIPS 2024

Multilingual Fact-Checking using LLMs EMNLP 2024

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models EMNLP 2024

Plot Twist: Multimodal Models Don’t Comprehend Simple Chart Details EMNLP 2024

ProbTS: Benchmarking Point and Distributional Forecasting across Diverse Prediction Horizons NIPS 2024

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models NIPS 2024

ODRL: A Benchmark for Off-Dynamics Reinforcement Learning NIPS 2024

Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models NIPS 2024

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles NIPS 2024

MedCalc-Bench: Evaluating Large Language Models for Medical Calculations NIPS 2024

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) NIPS 2024

MuChin: A Chinese Colloquial Description Benchmark for Evaluating Language Models in the Field of Music IJCAI 2024

Revisiting, Benchmarking and Understanding Unsupervised Graph Domain Adaptation NIPS 2024

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation EMNLP 2024

A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models EMNLP 2024

GeSS: Benchmarking Geometric Deep Learning under Scientific Applications with Distribution Shifts NIPS 2024

CUTE: Measuring LLMs’ Understanding of Their Tokens EMNLP 2024

African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification EMNLP 2024

UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation ACL 2024

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning EMNLP 2024

AMLB: an AutoML Benchmark JMLR 2024

CoIN: A Benchmark of Continual Instruction Tuning for Multimodel Large Language Models NIPS 2024

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NIPS 2024

GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI NIPS 2024

Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences ACL 2024