PEAP: Proactive Embodied Action Sequence Planning with Joint Understanding of Vision and Audio Perception

Tianwei Lan; Jiaqi Wu; Zeming Liu; Zhaoxin Fan; Haifeng Wang; Yuhang Guo

2026 ACL ACL 2026

PEAP: Proactive Embodied Action Sequence Planning with Joint Understanding of Vision and Audio Perception

Abstract

AbstractEmbodied Action Sequence Planning focuses on the capability of embodied agents to implement action planning via environmental perception. This technology enables diverse intelligent assistance for real-world scenarios such as home and office environments. To address the limitations of existing embodied agents in meeting the requirement for proactivity and achieving joint understanding of visual and audio information, this study investigates the ability of embodied agents to proactively provide assistance through action sequence planning based on joint understanding of vision and audio perception without explicit human instructions. Correspondingly, we propose PEAP, the first multimodal proactive embodied action sequence planning dataset. We evaluate the performance of multiple Large Language Models on the PEAP dataset. The results demonstrate that these models still exhibit significant deficiencies on this task particularly lacking accurate environmental perception capabilities. Furthermore, ablation experiment and replacement experiment further corroborate that the joint understanding of multimodal information can significantly improve the models’ performance on proactive embodied action sequence planning task.

Authors

Tianwei Lan , Jiaqi Wu , Zeming Liu , Zhaoxin Fan , Haifeng Wang , Yuhang Guo

Topics

Artificial Intelligence > Core AI > Planning Artificial Intelligence > Core AI > Robotics Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

embodied agent large language model multimodal perception action sequence planning environmental perception

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026