2020 COLING COLING 2020

Taxy.io@FinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning

Abstract

AbstractIn this paper we describe our system submitted to the FinTOC-2020 shared task on financial doc- ument structure extraction. We propose a two-step approach to identify titles in financial docu- ments and to extract their table of contents (TOC). First, we identify text blocks as candidates for titles using unsupervised learning based on character-level information of each document. Then, we apply supervised learning on a self-constructed regression task to predict the depth of each text block in the document structure hierarchy using transfer learning combined with document features and layout features. It is noteworthy that our single multilingual model performs well on both tasks and on different languages, which indicates the usefulness of transfer learning for title detection and TOC generation. Moreover, our approach is independent of the presence of actual TOC pages in the documents. It is also one of the few submissions to the FinTOC-2020 shared task addressing both subtasks in both languages, English and French, with one single model.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🐣 Hot Topic Early Bird — multilingual processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio