Co-occurring keywords
Papers
When2Call: When (not) to Call Tools
NAACL 2025
CULEMO: Cultural Lenses on Emotion - Benchmarking LLMs for Cross-Cultural Emotion Understanding
ACL 2025
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
ACL 2025
FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
ACL 2025
Something’s Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks
ACL 2025
Polishing Every Facet of the GEM: Testing Linguistic Competence of LLMs and Humans in Korean
ACL 2025
Are Bias Evaluation Methods Biased ?
ACL 2025