Understanding Large Language Model Based Metrics for Text Summarization

Abhishek Pradhan; Ketan Todi

2023 AACL AACL 2023

Understanding Large Language Model Based Metrics for Text Summarization

Abstract

AbstractThis paper compares the two most widely used techniques for evaluating generative tasks with large language models (LLMs): prompt-based evaluation and log-likelihood evaluation as part of the Eval4NLP shared task. We focus on the summarization task and evaluate both small and large LLM models. We also study the impact of LLAMA and LLAMA 2 on summarization, using the same set of prompts and techniques. We used the Eval4NLP dataset for our comparison. This study provides evidence of the advantages of prompt-based evaluation techniques over log-likelihood based techniques, especially for large models and models with better reasoning power.

🧭 Keyword Pioneer — prompt-based evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Abhishek Pradhan , Ketan Todi

Topics

Natural Language Processing > Generation > Summarization Natural Language Processing > Resources & Methods > Large Language Models

Keywords

text summarization summarization evaluation prompt-based evaluation log-likelihood evaluation large language model

Download PDF

Related papers

We Need to Talk About Classification Evaluation Metrics in NLP 2023

A Novel Dataset Towards Extracting Virus-Host Interactions 2023

Improving Neural Machine Translation with Offline Evaluations 2023

Perplexity-Driven Case Encoding Needs Augmentation for CAPITALIZATION Robustness 2023

Are Machine Reading Comprehension Systems Robust to Context Paraphrasing? 2023