Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

Kumiko Tanaka-Ishii

2026 ACL ACL 2026

Repeated Sequences Reveal Gaps between Large Language Models and Natural Language

Abstract

AbstractEvaluating whether large language models (LLMs) capture the structureof natural language beyond local fluency remains an open challenge.Existing evaluation methods, largely based on task performance orshort-context behavior, provide limited insight into the long-rangestatistical organization of generated text.We propose a complementary evaluation framework based on repeatedsubsequences. By analyzing their distribution across scales andrelating it to higher-order Rényi entropies, we probe how textsreuse previously established structure under finite-lengthconditions. Experiments on human-written texts and length-matchedGPT-generated texts show that,while power-law models can describerestricted ranges of block length, the observed entropy growth isoften equally or better characterized by logarithmic–power forms.Across datasets, natural language exhibits stable entropy-growthpatterns over accessible ranges, with consistent average behavior despite variability across individual texts. In contrast,GPT-generated texts show systematic and statistically significantshifts in estimated exponents with model size.These results demonstrate that repeated-subsequence entropyprovides a quantitative structural diagnostic that revealssystematic differences in long-range organization,distinguishing natural language from state-of-the-art LLM outputsbeyond surface-level fluency.

Authors

Kumiko Tanaka-Ishii

Topics

Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Evaluation Natural Language Processing > Applications > Evaluation

Keywords

text generation natural language renyi entropy repeated subsequence long-range organization

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026