Co-occurring keywords
Papers
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
ACL 2024
WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks
NIPS 2024
GTBench: Uncovering the Strategic Reasoning Capabilities of LLMs via Game-Theoretic Evaluations
NIPS 2024
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
EMNLP 2024