Corpus-Dependent Subcharacter Encoding via HMM-Guided Code Assignment
Abstract
AbstractWe propose a corpus-dependent alternative to byte encoding that learns fixed-length atomic codes for characters directly from text, which we refer to as Latom (Learned Atom-based Encoding).We instantiate this framework by training an HMM on N-repeated character sequences to estimate "atom" posteriors, followed by a Hungarian assignment yielding a globally optimal one-to-one character-code mapping.Across 14 languages, the encodings improve intrinsic metrics, including token counts after subword tokenization and bigram perplexity, with appropriate code lengths.On Amazon Reviews in six languages, Latom improves text classification accuracy and reduces decoding errors in language model generation.Overall, these results demonstrate that character encodings can be learned from corpus statistics while remaining reversible and compatible with standard tokenization pipelines.