Co-occurring keywords
Papers
Hi-ToM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models
EMNLP 2023
Seeing is not always believing: Benchmarking Human and Model Perception of AI-Generated Images
NIPS 2023
VisIT-Bench: A Dynamic Benchmark for Evaluating Instruction-Following Vision-and-Language Models
NIPS 2023
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models
NIPS 2023