← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Noise-Aware Evaluation of Object Detectors WACV 2025

Label Convergence: Defining an Upper Performance Bound in Object Recognition through Contradictory Annotations WACV 2025

T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation CVPR 2025

Image Generation Diversity Issues and How to Tame Them CVPR 2025

VideoAutoArena: An Automated Arena for Evaluating Large Multimodal Models in Video Analysis through User Simulation CVPR 2025

K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences CVPR 2025

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation CVPR 2025

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching CVPR 2025

ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries ACL 2025

LOFT: Scalable and More Realistic Long-Context Evaluation NAACL 2025

Aligning Black-box Language Models with Human Judgments NAACL 2025

CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models ACL 2025

What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios COLING 2025

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives IJCAI 2025

CaLQuest.PT: Towards the Collection and Evaluation of Natural Causal Ladder Questions in Portuguese for AI Agents COLING 2025

Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories ACL 2025

Are Language Models Agnostic to Linguistically Grounded Perturbations? A Case Study of Indic Languages NAACL 2025

Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation NAACL 2025

Towards Region-aware Bias Evaluation Metrics NAACL 2025

SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science ACL 2025

Why should only High-Resource-Languages have all the fun? Pivot Based Evaluation in Low Resource Setting COLING 2025

BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language COLING 2025

Does Training on Synthetic Data Make Models Less Robust? NAACL 2025

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging ACL 2025

Only a Little to the Left: A Theory-grounded Measure of Political Bias in Large Language Models ACL 2025