model evaluation

442 papers

Explore in graph

Co-occurring keywords

large language model (12755) benchmark evaluation (1539) text classification (6776) language model (4573) natural language processing (2027) evaluation benchmark (250) natural language inference (1278) multimodal learning (4622) bias detection (419) transfer learning (5442)

Papers

Beyond Accuracy: Behavioral Testing of NLP Models with Checklist (Extended Abstract) IJCAI 2021

ExplainaBoard: An Explainable Leaderboard for NLP ACL 2021

A Case Study of Efficacy and Challenges in Practical Human-in-Loop Evaluation of NLP Systems Using Checklist EACL 2021

Active Assessment of Prediction Services as Accuracy Surface Over Attribute Combinations NIPS 2021

Perception Matters: Detecting Perception Failures of VQA Models Using Metamorphic Testing CVPR 2021

Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards? ACL 2021

Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models IJCNLP 2021

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text IJCNLP 2021

Anatomy of OntoGUM—Adapting GUM to the OntoNotes Scheme to Evaluate Robustness of SOTA Coreference Algorithms EMNLP 2021

DynaSent: A Dynamic Benchmark for Sentiment Analysis IJCNLP 2021

HateCheck: Functional Tests for Hate Speech Detection Models ACL 2021

Guiding Principles for Participatory Design-inspired Natural Language Processing ACL 2021

Perceptual Score: What Data Modalities Does Your Model Perceive? NIPS 2021

How Robust are Model Rankings : A Leaderboard Customization Approach for Equitable Evaluation AAAI 2021

Comparing Test Sets with Item Response Theory IJCNLP 2021

Active Testing: Sample-Efficient Model Evaluation ICML 2021

Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking NIPS 2021

Utility is in the Eye of the User: A Critique of NLP Leaderboards EMNLP 2020

Interpretable Multi-dataset Evaluation for Named Entity Recognition EMNLP 2020

Let’s Stop Incorrect Comparisons in End-to-end Relation Extraction! EMNLP 2020

RethinkCWS: Is Chinese Word Segmentation a Solved Task? EMNLP 2020

Adversarial NLI: A New Benchmark for Natural Language Understanding ACL 2020

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! EMNLP 2020

How Effectively Can Machines Defend Against Machine-Generated Fake News? An Empirical Study EMNLP 2020

Evaluating Models’ Local Decision Boundaries via Contrast Sets EMNLP 2020