Co-occurring keywords
Papers
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models
EMNLP 2024
GTA: A Benchmark for General Tool Agents
NIPS 2024
KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark
COLING 2024
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property
COLING 2024
Question Answering over Tabular Data with DataBench: A Large-Scale Empirical Evaluation of LLMs
COLING 2024
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models
NIPS 2024
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models
ACL 2024
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
NIPS 2024
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia
NIPS 2024