Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Generative Interpretation: Toward Human-Like Evaluation for Educational Question-Answer Pair Generation
EACL 2024
Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness
ACL 2024
Conditional and Modal Reasoning in Large Language Models
EMNLP 2024
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
NIPS 2024
SeeGULL Multilingual: a Dataset of Geo-Culturally Situated Stereotypes
ACL 2024
What Kind of Sourcery is This? Evaluating GPT-4’s Performance on Linking Scientific Fact to Citations
EMNLP 2024
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
ACL 2024
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
ACL 2024
Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates
EMNLP 2024
Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models
ACL 2024
Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark
ACL 2024
Intuitive or Dependent? Investigating LLMs’ Behavior Style to Conflicting Prompts
ACL 2024
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models
EMNLP 2024
Language Models Do Hard Arithmetic Tasks Easily and Hardly Do Easy Arithmetic Tasks
ACL 2024
Understanding and Mitigating Language Confusion in LLMs
EMNLP 2024
FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability
ACL 2024
BotEval: Facilitating Interactive Human Evaluation
ACL 2024
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs
ACL 2024
LJPCheck: Functional Tests for Legal Judgment Prediction
ACL 2024
MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans?
EMNLP 2024
PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
EMNLP 2024
EASSE-DE & EASSE-multi: Easier Automatic Sentence Simplification Evaluation for German & Multiple Languages
EMNLP 2024
Analyzing LLM Behavior in Dialogue Summarization: Unveiling Circumstantial Hallucination Trends
ACL 2024
NaturalCodeBench: Examining Coding Performance Mismatch on HumanEval and Natural User Queries
ACL 2024
How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics
EMNLP 2024
<
1
…
28
29
30
…
67
>