← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords ACL 2025

Trick or Neat: Adversarial Ambiguity and Language Model Evaluation ACL 2025

MDBench: A Synthetic Multi-Document Reasoning Benchmark Generated with Knowledge Guidance ACL 2025

Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models ACL 2025

7 Points to Tsinghua but 10 Points to ? Assessing Large Language Models in Agentic Multilingual National Bias ACL 2025

Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios ICCV 2025

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation CVPR 2025

Rethinking Prompt-based Debiasing in Large Language Model ACL 2025

Image Generation Diversity Issues and How to Tame Them CVPR 2025

SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation ACL 2025

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation CVPR 2025

Analyzing Interview Questions via Bloom’s Taxonomy to Enhance the Design Thinking Process ACL 2025

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences CVPR 2025

Judging It, Washing It: Scoring and Greenwashing Corporate Climate Disclosures using Large Language Models ACL 2025

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation CVPR 2025

WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization ACL 2025

Toward Reliable Scientific Hypothesis Generation: Evaluating Truthfulness and Hallucination in Large Language Models IJCAI 2025

Contamination Budget: Trade-offs Between Breadth, Depth and Difficulty IJCAI 2025

Game Theory Meets Large Language Models: A Systematic Survey IJCAI 2025

Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives ACL 2024

“My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models ACL 2024

The State of Relation Extraction Data Quality: Is Bigger Always Better? ACL 2024

Introducing GenCeption for Multimodal LLM Benchmarking: You May Bypass Annotations NAACL 2024

Holistic Evaluation of Large Language Models: Assessing Robustness, Accuracy, and Toxicity for Real-World Applications NAACL 2024

HANS, are you clever? Clever Hans Effect Analysis of Neural Systems NAACL 2024