Co-occurring keywords
Papers
PlanningArena: A Modular Benchmark for Multidimensional Evaluation of Planning and Tool Learning
ACL 2025
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
ACL 2025
CAPTURE: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
ICCV 2025
MultiVerse: A Multi-Turn Conversation Benchmark for Evaluating Large Vision and Language Models
ICCV 2025
Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine
AAAI 2025
MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models
AAAI 2025
Are We Done with MMLU?
NAACL 2025