Co-occurring keywords
Papers
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
ACL 2025
Blind Men and the Elephant: Diverse Perspectives on Gender Stereotypes in Benchmark Datasets
EMNLP 2025
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
EMNLP 2025
Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
ACL 2025
Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine
AAAI 2025
CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
ICCV 2025
LLM Agents Making Agent Tools
ACL 2025