benchmark evaluation

1539 papers

Explore in graph

Also known as

MT-BENCH BDC

Co-occurring keywords

large language model (12755) question answering (2904) multimodal learning (4622) language model (4573) multimodal large language model (865) vision-language model (2235) visual question answering (1000) evaluation benchmark (250) multilingual nlp (1423) benchmark dataset (619)

Papers

SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models NIPS 2024

Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models EMNLP 2024

CLEAR: Can Language Models Really Understand Causal Graphs? EMNLP 2024

Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs EMNLP 2024

Towards Benchmarking Situational Awareness of Large Language Models:Comprehensive Benchmark, Evaluation and Analysis EMNLP 2024

LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models ACL 2024

∞Bench: Extending Long Context Evaluation Beyond 100K Tokens ACL 2024

M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models ACL 2024

ToMBench: Benchmarking Theory of Mind in Large Language Models ACL 2024

LooGLE: Can Long-Context Language Models Understand Long Contexts? ACL 2024

ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction ACL 2024

Collaboration or Corporate Capture? Quantifying NLP’s Reliance on Industry Artifacts and Contributions ACL 2024

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving ACL 2024

Uncovering Limitations of Large Language Models in Information Seeking from Tables ACL 2024

CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems ACL 2024

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning ACL 2024

SocialBench: Sociality Evaluation of Role-Playing Conversational Agents ACL 2024

Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future ACL 2024

Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios ACL 2024

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ ACL 2024

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation ACL 2024

Evaluating Large Language Models on Wikipedia-Style Survey Generation ACL 2024

All Languages Matter: On the Multilingual Safety of LLMs ACL 2024

CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions ACL 2024

Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models ACL 2024