Corpus-Dependent Subcharacter Encoding via HMM-Guided Code Assignment

Tatsuya Hiraoka

2026 ACL ACL 2026

Corpus-Dependent Subcharacter Encoding via HMM-Guided Code Assignment

Abstract

AbstractWe propose a corpus-dependent alternative to byte encoding that learns fixed-length atomic codes for characters directly from text, which we refer to as Latom (Learned Atom-based Encoding).We instantiate this framework by training an HMM on N-repeated character sequences to estimate "atom" posteriors, followed by a Hungarian assignment yielding a globally optimal one-to-one character-code mapping.Across 14 languages, the encodings improve intrinsic metrics, including token counts after subword tokenization and bigram perplexity, with appropriate code lengths.On Amazon Reviews in six languages, Latom improves text classification accuracy and reduces decoding errors in language model generation.Overall, these results demonstrate that character encodings can be learned from corpus statistics while remaining reversible and compatible with standard tokenization pipelines.

Authors

Tatsuya Hiraoka

Topics

Natural Language Processing > Resources & Methods > Text Representation Natural Language Processing > Applications > Text Processing Natural Language Processing > Resources & Methods > Pretraining

Keywords

text classification hidden markov model subword tokenization byte encoding subcharacter encoding

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026