benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

Do Localization Methods Actually Localize Memorized Data in LLMs? A Tale of Two Benchmarks NAACL 2024

SeaEval for Multilingual Foundation Models: From Cross-Lingual Alignment to Cultural Reasoning NAACL 2024

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs CVPR 2024

CARE: Extracting Experimental Findings From Clinical Literature NAACL 2024

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models CVPR 2024

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models NIPS 2024

TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models ACL 2024

Large Language Models Are Poor Clinical Decision-Makers: A Comprehensive Benchmark EMNLP 2024

TRoTR: A Framework for Evaluating the Re-contextualization of Text Reuse EMNLP 2024

Still Not Quite There! Evaluating Large Language Models for Comorbid Mental Health Diagnosis EMNLP 2024

LLM-Evolve: Evaluation for LLM’s Evolving Capability on Benchmarks EMNLP 2024

Data Contamination Can Cross Language Barriers EMNLP 2024

Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models EMNLP 2024

MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models EMNLP 2024

Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models EMNLP 2024

Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game EMNLP 2024

Evaluating Large Language Models on Time Series Feature Understanding: A Comprehensive Taxonomy and Benchmark EMNLP 2024

The Greatest Good Benchmark: Measuring LLMs’ Alignment with Utilitarian Moral Dilemmas EMNLP 2024

Needle In A Multimodal Haystack NIPS 2024

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once? ACL 2024

Benchmarking Data Science Agents ACL 2024

Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs NIPS 2024

Shopping MMLU: A Massive Multi-Task Online Shopping Benchmark for Large Language Models NIPS 2024

Do Text-to-Vis Benchmarks Test Real Use of Visualisations? EMNLP 2024

Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models EMNLP 2024