Co-occurring keywords
Papers
MisinfoBench: A Multi-Dimensional Benchmark for Evaluating LLMs’ Resilience to Misinformation
EMNLP 2025
Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM
EMNLP 2025
VisEscape: A Benchmark for Evaluating Exploration-driven Decision-making in Virtual Escape Rooms
EMNLP 2025
FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games
EMNLP 2025