Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
VCSearch: Bridging the Gap Between Well-Defined and Ill-Defined Problems in Mathematical Reasoning
EMNLP 2025
AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories
EMNLP 2025
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
ACL 2025
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
EMNLP 2025
EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
AAAI 2025
o-MEGA: Optimized Methods for Explanation Generation and Analysis
EMNLP 2025
ODP-Bench: Benchmarking Out-of-Distribution Performance Prediction
ICCV 2025
TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs
EMNLP 2025
Membership Inference Attacks with False Discovery Rate Control
ICCV 2025
SAGE: A Generic Framework for LLM Safety Evaluation
EMNLP 2025
InductionBench: LLMs Fail in the Simplest Complexity Class
ACL 2025
Truth, Trust, and Trouble: Medical AI on the Edge
EMNLP 2025
An Empirical Study of Position Bias in Modern Information Retrieval
EMNLP 2025
InstaJudge: Aligning Judgment Bias of LLM-as-Judge with Humans in Industry Applications
EMNLP 2025
Conflicts in Texts: Data, Implications and Challenges
EMNLP 2025
From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
EMNLP 2025
HighMATH: Evaluating Math Reasoning of Large Language Models in Breadth and Depth
EMNLP 2025
Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation
EMNLP 2025
Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models
ACL 2025
Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution
ACL 2025
Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm
ACL 2025
Theory of Mind in Large Language Models: Assessment and Enhancement
ACL 2025
When Models Lie, We Learn: Multilingual Span-Level Hallucination Detection with PsiloQA
EMNLP 2025
SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models
ACL 2025
MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses
EMNLP 2025
<
1
…
21
22
23
…
67
>