Predicting the Year of Total Knee Replacement: A Transformer-Based Multimodal Approach
Abstract
Accurate prediction of the year of total knee replacement (TKR) is challenging due tothe complex interplay of factors influencing the surgical decision. Current deep learningmodels often rely on single-modality data, limiting their predictive power. Multimodalapproaches integrating imaging and patient data offer the potential to improve predictionsand support clinical decisions. This study presents an end-to-end trained, transformer-based multimodal model that integrates MR imaging with tabular data, including clinicalvariables and image readings, to predict the year of TKR for each subject. Our model lever-ages cross-modal attention to fuse features from an image encoder with a self-supervisedpretrained tabular encoder, achieving the highest accuracy of 63.4% among tested mod-els. We evaluated its performance against three unimodal models and four multimodalfusion strategies, including simple concatenation, DAFT, and multimodal interaction. Theresults demonstrate that our model’s cross-modal interaction approach with pretrainedTabNet not only outperformed all unimodal models but also showed improvements overother multimodal fusion techniques, highlighting the effectiveness of cross-modal attentionfusion for integrating complex data modalities in TKR year prediction tasks. Source codeis available at https://github.com/denizlab/2025_MIDL_time2TKR.