Co-occurring keywords
Papers
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
ACL 2025
CULEMO: Cultural Lenses on Emotion - Benchmarking LLMs for Cross-Cultural Emotion Understanding
ACL 2025
Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
ACL 2025
PlanningArena: A Modular Benchmark for Multidimensional Evaluation of Planning and Tool Learning
ACL 2025
See the World, Discover Knowledge: A Chinese Factuality Evaluation for Large Vision Language Models
ACL 2025
Are Bias Evaluation Methods Biased ?
ACL 2025