Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
“A good pun is its own reword”: Can Large Language Models Understand Puns?
EMNLP 2024
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
EMNLP 2024
Humans or LLMs as the Judge? A Study on Judgement Bias
EMNLP 2024
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
EMNLP 2024
LawBench: Benchmarking Legal Knowledge of Large Language Models
EMNLP 2024
F-Eval: Asssessing Fundamental Abilities with Refined Evaluation Methods
ACL 2024
PARIKSHA: A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
EMNLP 2024
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
EMNLP 2024
Large Language Models are not Fair Evaluators
ACL 2024
Understanding and Mitigating Language Confusion in LLMs
EMNLP 2024
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models
EMNLP 2024
T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step
ACL 2024
Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates
EMNLP 2024
ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
EMNLP 2024
Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method
EMNLP 2024
LUQ: Long-text Uncertainty Quantification for LLMs
EMNLP 2024
Why Don’t Prompt-Based Fairness Metrics Correlate?
ACL 2024
MM-SAP: A Comprehensive Benchmark for Assessing Self-Awareness of Multimodal Large Language Models in Perception
ACL 2024
Beyond Reference: Evaluating High Quality Translations Better than Human References
EMNLP 2024
FOOL ME IF YOU CAN! An Adversarial Dataset to Investigate the Robustness of LMs in Word Sense Disambiguation
EMNLP 2024
An Analysis of Multilingual FActScore
EMNLP 2024
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
EMNLP 2024
When Reasoning Meets Information Aggregation: A Case Study with Sports Narratives
EMNLP 2024
Different Tastes of Entities: Investigating Human Label Variation in Named Entity Annotations
EACL 2024
Order Effects in Annotation Tasks: Further Evidence of Annotation Sensitivity
EACL 2024
<
1
…
25
26
27
…
67
>