← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

On the Content Bias in Frechet Video Distance CVPR 2024

On the Faithfulness of Vision Transformer Explanations CVPR 2024

A Toolbox for Modelling Engagement with Educational Videos AAAI 2024

Better than Random: Reliable NLG Human Evaluation with Constrained Active Sampling AAAI 2024

LatestEval: Addressing Data Contamination in Language Model Evaluation through Dynamic and Time-Sensitive Test Construction AAAI 2024

A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark CVPR 2024

The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes CVPR 2024

Can Biases in ImageNet Models Explain Generalization? CVPR 2024

CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models ACL 2024

Measuring the Inconsistency of Large Language Models in Preferential Ranking ACL 2024

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models ACL 2024

PUB: A Pragmatics Understanding Benchmark for Assessing LLMs’ Pragmatics Capabilities ACL 2024

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models ACL 2024

Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in Machine Translation Evaluation ACL 2024

CMoralEval: A Moral Evaluation Benchmark for Chinese Large Language Models ACL 2024

Data Contamination Calibration for Black-box LLMs ACL 2024

ACUEval: Fine-grained Hallucination Evaluation and Correction for Abstractive Summarization ACL 2024

Are LLMs Capable of Data-based Statistical and Causal Reasoning? Benchmarking Advanced Quantitative Reasoning with Data ACL 2024

TaskBench: Benchmarking Large Language Models for Task Automation NIPS 2024

Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models NIPS 2024

Towards Human-AI Complementarity with Prediction Sets NIPS 2024

Questioning the Survey Responses of Large Language Models NIPS 2024

Rethinking LLM Memorization through the Lens of Adversarial Compression NIPS 2024

Conformalized Multiple Testing after Data-dependent Selection NIPS 2024

An engine not a camera: Measuring performative power of online search NIPS 2024