Co-occurring keywords
Papers
DHP Benchmark: Are LLMs Good NLG Evaluators?
NAACL 2025
Beyond Visual Understanding Introducing PARROT-360V for Vision Language Model Benchmarking
COLING 2025
GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models
NAACL 2025