Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Keywords
benchmark evaluation
1539 papers
Explore in graph
Also known as
MT-BENCH
BDC
Co-occurring keywords
large language model
(12755)
question answering
(2904)
multimodal learning
(4622)
language model
(4573)
multimodal large language model
(865)
vision-language model
(2235)
visual question answering
(1000)
evaluation benchmark
(250)
multilingual nlp
(1423)
benchmark dataset
(619)
Papers
SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models
NIPS 2024
Developing a Pragmatic Benchmark for Assessing Korean Legal Language Understanding in Large Language Models
EMNLP 2024
CLEAR: Can Language Models Really Understand Causal Graphs?
EMNLP 2024
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs
EMNLP 2024
Towards Benchmarking Situational Awareness of Large Language Models:Comprehensive Benchmark, Evaluation and Analysis
EMNLP 2024
LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models
ACL 2024
∞Bench: Extending Long Context Evaluation Beyond 100K Tokens
ACL 2024
M4LE: A Multi-Ability Multi-Range Multi-Task Multi-Domain Long-Context Evaluation Benchmark for Large Language Models
ACL 2024
ToMBench: Benchmarking Theory of Mind in Large Language Models
ACL 2024
LooGLE: Can Long-Context Language Models Understand Long Contexts?
ACL 2024
ImplicitAVE: An Open-Source Dataset and Multimodal LLMs Benchmark for Implicit Attribute Value Extraction
ACL 2024
Collaboration or Corporate Capture? Quantifying NLP’s Reliance on Industry Artifacts and Contributions
ACL 2024
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving
ACL 2024
Uncovering Limitations of Large Language Models in Information Seeking from Tables
ACL 2024
CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems
ACL 2024
CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
ACL 2024
SocialBench: Sociality Evaluation of Role-Playing Conversational Agents
ACL 2024
Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future
ACL 2024
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios
ACL 2024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
ACL 2024
StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation
ACL 2024
Evaluating Large Language Models on Wikipedia-Style Survey Generation
ACL 2024
All Languages Matter: On the Multilingual Safety of LLMs
ACL 2024
CToolEval: A Chinese Benchmark for LLM-Powered Agent Evaluation in Real-World API Interactions
ACL 2024
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models
ACL 2024
<
1
…
40
41
42
…
62
>