Co-occurring keywords
Papers
M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
NIPS 2023
PIXIU: A Comprehensive Benchmark, Instruction Dataset and Large Language Model for Finance
NIPS 2023
Benchmarking Large Language Models on CMExam - A comprehensive Chinese Medical Exam Dataset
NIPS 2023
Revisiting Out-of-distribution Robustness in NLP: Benchmarks, Analysis, and LLMs Evaluations
NIPS 2023
Mathematical Capabilities of ChatGPT
NIPS 2023
RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations
ACL 2023