Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords
ACL 2025
Trick or Neat: Adversarial Ambiguity and Language Model Evaluation
ACL 2025
MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance
ACL 2025
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
ACL 2025
7 Points to Tsinghua but 10 Points to ? Assessing Large Language Models in Agentic Multilingual National Bias
ACL 2025
Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios
ICCV 2025
T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation
CVPR 2025
Rethinking Prompt-based Debiasing in Large Language Model
ACL 2025
Image Generation Diversity Issues and How to Tame Them
CVPR 2025
SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation
ACL 2025
VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation
CVPR 2025
Analyzing Interview Questions via Bloom’s Taxonomy to Enhance the Design Thinking Process
ACL 2025
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences
CVPR 2025
Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models
ACL 2025
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
CVPR 2025
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
ACL 2025
Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models
IJCAI 2025
Contamination Budget: Trade-offs Between Breadth, Depth and Difficulty
IJCAI 2025
Game Theory Meets Large Language Models: A Systematic Survey
IJCAI 2025
Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives
ACL 2024
“My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
ACL 2024
The State of Relation Extraction Data Quality: Is Bigger Always Better?
ACL 2024
Introducing GenCeption for Multimodal LLM Benchmarking: You May Bypass Annotations
NAACL 2024
Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications
NAACL 2024
HANS, are you clever? Clever Hans Effect Analysis of Neural Systems
NAACL 2024
<
1
…
22
23
24
…
67
>