Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Noise-Aware Evaluation of Object Detectors
WACV 2025
Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations
WACV 2025
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
CVPR 2025
Image Generation Diversity Issues and How to Tame Them
CVPR 2025
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
CVPR 2025
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
CVPR 2025
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
CVPR 2025
Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching
CVPR 2025
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
ACL 2025
LOFT: Scalable and More Realistic Long-Context Evaluation
NAACL 2025
Aligning Black-box Language Models with Human Judgments
NAACL 2025
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
ACL 2025
What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios
COLING 2025
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
IJCAI 2025
CaLQuest.PT: Towards the Collection and Evaluation of Natural Causal Ladder Questions in Portuguese for AI Agents
COLING 2025
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
ACL 2025
Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages
NAACL 2025
Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation
NAACL 2025
Towards Region-aware Bias Evaluation Metrics
NAACL 2025
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science
ACL 2025
Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting
COLING 2025
BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language
COLING 2025
Does Training on Synthetic Data Make Models Less Robust?
NAACL 2025
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
ACL 2025
Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models
ACL 2025
<
1
…
11
12
13
…
67
>