Shanks: Simultaneous Hearing and Thinking for Spoken Language Models

Cheng-Han Chiang; Xiaofei Wang; Linjie Li; Chung-Ching Lin; Kevin Lin; Shujie LIU; Zhendong Wang; Zhengyuan Yang; Hung-yi Lee; Lijuan Wang

2026 ACL ACL 2026

Shanks: Simultaneous Hearing and Thinking for Spoken Language Models

Abstract

AbstractCurrent large language models (LLMs) and spoken language models (SLMs) begin thinking and taking actions only after the user has finished their turn. This prevents the model from interacting with the user during the user’s turn and can lead to high response latency when the model is thinking. To address this issue, we draw inspiration from the “think while listening” behavior of humans. In this paper, we propose SHANKS, a general inference framework that enables SLMs to generate unspoken chain-of-thought reasoning while listening to user input. SHANKS streams input speech in fixed-duration chunks and, as soon as a chunk is received, generates unspoken reasoning based on all previous speech and reasoning, while the user continues speaking. SHANKS uses unspoken reasoning to determine whether to interrupt the user and make tool calls to complete the task. We demonstrate that SHANKS enhances real-time user–SLM interaction in two scenarios: (1) SHANKS can listen to the user’s speech and interrupt when the user makes a mistake. (2) In a tool-augmented dialogue scenario, SHANKS can complete 56.9% of the tool calls before the user ends their turn. Overall, SHANKS is a step toward models that keep thinking throughout the conversation, not only after a turn ends. Demos can be found on the project page: https://d223302.github.io/SHANKS/.

Authors

Cheng-Han Chiang , Xiaofei Wang , Linjie Li , Chung-Ching Lin , Kevin Lin , Shujie LIU , Zhendong Wang , Zhengyuan Yang , Hung-yi Lee , Lijuan Wang

Topics

Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Speech Processing Deep Learning > Learning Types > Chain-of-Thought Reasoning

Keywords

chain-of-thought reasoning real-time interaction streaming inference spoken language model tool-augmented dialogue

Download PDF

Related papers

No Reader Left Behind: Multi-Agent Summaries Everyone Can Understand 2026

One-step Nonautoregressive Natural Language Generation with Shortcut Flow Matching Models 2026

Optimizing Retrieval-Augmented Generation for E-Commerce How-To Assistance 2026

Make Mechanistic Interpretability Auditable: A Call to Develop Guidelines via Continuous Collaborative Reviewing 2026

MQM Re-Annotation: A Technique for Collaborative Evaluation of Machine Translation 2026