Flow-Based Page Unique Semantic Mapping Architecture for Document Visual Question Answering

Haosen Wang; Jing Xiao; Chaochao Du; Xiaowang Zhang; Zhiyong Feng

2026 ACL ACL 2026

Flow-Based Page Unique Semantic Mapping Architecture for Document Visual Question Answering

Abstract

AbstractDocument Visual Question Answering (DocVQA) aims to generate answers by jointly understanding the textual, layout, and visual elements within document images. Although end-to-end vision-based generative methods have reduced dependency on OCR, they still struggle to achieve precise evidence localization when page semantics are complex and highly similar. However, existing research lacks an in-depth theoretical analysis of the question-driven semantic representation space, failing to fundamentally address the distinguishability problem among semantically similar pages. To fill this theoretical gap, we propose and prove that, given a specific question, each page possesses a unique semantic representation, and there exists a bijective mapping between the page and its unique semantics. Based on this theoretical foundation, we introduce the Flow-Based Page Unique Semantic Mapping Architecture (FUMA), which reconstructs evidence localization from similarity-based retrieval into precise selection on unique semantics. FUMA employs fine-grained cross-modal attention to extract discriminative cues and utilizes flow-based reversible transformations with likelihood regularization to learn bijective mappings, ensuring that each page obtains a unique semantic representation. Moreover, a multi-expert collaboration mechanism complementarily models fine-grained multimodal information within each page, achieving robust answer generation. Experimental results demonstrate that FUMA significantly outperforms existing methods in both evidence localization and answer generation.

Authors

Haosen Wang , Jing Xiao , Chaochao Du , Xiaowang Zhang , Zhiyong Feng

Topics

Deep Learning > Architectures > Transformers Natural Language Processing > Applications > Document Analysis Computer Vision > Applications > Visual Question Answering

Keywords

semantic representation cross-modal attention document visual question answering bijective mapping flow-based reversible transformation

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026