Shinji Watanabe
186 papers · 2013–2026 · 11 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+21 more ↓ Show less ↑
๐งญ Keyword Pioneer ๐บ๏ธ Taxonomy Completionist (39) ๐ Renaissance Researcher (5) ๐ Interdisciplinary Bridge ๐ฃ Hot Topic Early Bird
๐งญ
Keyword Pioneer
๐ฃ
Hot Topic Early Bird
๐
Renaissance Researcher
(5)
๐
Keyword Trendsetter Combo
(8)
๐
Conference Loyalist
(22)
๐
Domain Dominant
(51)
๐ค
Dynamic Duo
(35)
๐
Triple Crown
๐ฑ
Topic Pioneer
๐ฌ
Deep Specialist
(27)
๐งฌ
Topic Evolution
๐
Keyword Champion
(6)
๐
Grand Slam
๐ฅ
Mega-Team
(76)
๐
Century Club
(180)
๐
Conference Pioneer
๐ฅ
Unstoppable
(10)
โ
The Questioner
(3)
โก
Prolific Year
(31)
๐๏ธ
Keyword Collector
(199)
๐
Trend Setter
Conferences
INTERSPEECH (120)
ACL (26)
NAACL (12)
EMNLP (6)
EACL (4)
ICLR (4)
ICML (4)
AAAI (3)
IJCNLP (3)
IJCAI (2)
NIPS (2)
Top co-authors
Research topics
Keywords
automatic speech recognition
(52)
speech recognition
(31)
self-supervised learning
(22)
end-to-end speech recognition
(21)
speech translation
(21)
speech enhancement
(16)
end-to-end model
(16)
spoken language understanding
(15)
connectionist temporal classification
(12)
beam search
(10)
attention mechanism
(9)
end-to-end learning
(9)
speech processing
(9)
neural network
(9)
speaker diarization
(8)
speech separation
(8)
speech synthesis
(8)
language model
(8)
data augmentation
(7)
transfer learning
(7)
Papers
PRiSM: Benchmarking Phone Realization in Speech Models
ACL 2026
Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner
ACL 2026
CSPB: Conversational Speech Processing Benchmark for Self-supervised Speech Models
EACL 2026
BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction
EACL 2026
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
ACL 2026
Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception
ACL 2026
Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment
NAACL 2025
SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models
ACL 2025
Summarizing Speech: A Comprehensive Survey
EMNLP 2025
Context-aware Dynamic Pruning for Speech Foundation Models
ICLR 2025
Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics
ICLR 2025
Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks
ICLR 2025
OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models
ICML 2025
Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization
AAAI 2025
ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
NAACL 2025
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
NAACL 2025
ESPnet-SpeechLM: An Open Speech Language Model Toolkit
NAACL 2025
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning
NAACL 2025
Wav2Gloss: Generating Interlinear Glossed Text from Speech
ACL 2024
Self-Supervised Speech Representations are More Phonetic than Semantic
INTERSPEECH 2024
Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features?
INTERSPEECH 2024
MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels
INTERSPEECH 2024
OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer
INTERSPEECH 2024
Neural Blind Source Separation and Diarization for Distant Speech Recognition
INTERSPEECH 2024
Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model
INTERSPEECH 2024
CMUโs IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness
ACL 2024
Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition
INTERSPEECH 2024
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head
AAAI 2024
ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets
INTERSPEECH 2024
URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement
INTERSPEECH 2024
Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing
INTERSPEECH 2024
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
INTERSPEECH 2024
Towards Robust Speech Representation Learning for Thousands of Languages
EMNLP 2024
FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model
EMNLP 2024
FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN
ACL 2024
Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement
INTERSPEECH 2024
Cross-Talk Reduction
IJCAI 2024
SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics
INTERSPEECH 2024
UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions
NAACL 2024
EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation
INTERSPEECH 2024
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model
INTERSPEECH 2024
CMUโs IWSLT 2024 Simultaneous Speech Translation System
ACL 2024
Decoder-only Architecture for Streaming End-to-end Speech Recognition
INTERSPEECH 2024
ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models
INTERSPEECH 2024
DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding
INTERSPEECH 2024
EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios
INTERSPEECH 2024
On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models
INTERSPEECH 2024
Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss
INTERSPEECH 2024
To what extent can ASV systems naturally defend against spoofing attacks?
INTERSPEECH 2024
Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting
INTERSPEECH 2024
Self-training ASR Guided by Unsupervised ASR Teacher
INTERSPEECH 2024
On the Evaluation of Speech Foundation Models for Spoken Language Understanding
ACL 2024
OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification
ACL 2024
SigMoreFun Submission to the SIGMORPHON Shared Task on Interlinear Glossing
ACL 2023
DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models
INTERSPEECH 2023
Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization
INTERSPEECH 2023
Tensor decomposition for minimization of E2E SLU model toward on-device processing
INTERSPEECH 2023
Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding
INTERSPEECH 2023
ML-SUPERB: Multilingual Speech Universal PERformance Benchmark
INTERSPEECH 2023
Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition
INTERSPEECH 2023
Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning
INTERSPEECH 2023
BASS: Block-wise Adaptation for Speech Summarization
INTERSPEECH 2023
A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning
INTERSPEECH 2023
A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks
INTERSPEECH 2023
Exploration on HuBERT with Multiple Resolution
INTERSPEECH 2023
CTC Alignments Improve Autoregressive Translation
EACL 2023
BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS
ICLR 2023
Efficient Sequence Transduction by Jointly Predicting Tokens and Durations
ICML 2023
Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining
IJCAI 2023
Deep Speech Synthesis from MRI-Based Articulatory Representations
INTERSPEECH 2023
Bayes Risk Transducer: Transducer with Controllable Alignment Prediction
INTERSPEECH 2023
Time-synchronous one-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training
INTERSPEECH 2023
Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute
INTERSPEECH 2023
Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff
INTERSPEECH 2023
UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures
NIPS 2023
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech
AAAI 2023
SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks
ACL 2023
4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders
INTERSPEECH 2023
UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units
ACL 2023
ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit
ACL 2023
FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN
ACL 2023
CMUโs IWSLT 2023 Simultaneous Speech Translation System
ACL 2023
Improving Speech Enhancement through Fine-Grained Speech Characteristics
INTERSPEECH 2022
TriniTTS: Pitch-controllable End-to-end TTS without External Aligner
INTERSPEECH 2022
SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy
INTERSPEECH 2022
Residual Language Model for End-to-end Speech Recognition
INTERSPEECH 2022
Deep Speech Synthesis from Articulatory Representations
INTERSPEECH 2022
Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR
INTERSPEECH 2022
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities
ACL 2022
Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble
ACL 2022
Findings of the IWSLT 2022 Evaluation Campaign
ACL 2022
CMUโs IWSLT 2022 Dialect Speech Translation System
ACL 2022
Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis
INTERSPEECH 2022
VQ-T: RNN Transducers using Vector-Quantized Prediction Network States
INTERSPEECH 2022
Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation
INTERSPEECH 2022
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis
INTERSPEECH 2022
Minimum latency training of sequence transducers for streaming end-to-end speech recognition
INTERSPEECH 2022
Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models
INTERSPEECH 2022
Online Continual Learning of End-to-End Speech Recognition Models
INTERSPEECH 2022
End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation
INTERSPEECH 2022
Self-supervised Representation Learning for Speech Processing
NAACL 2022
Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding
ICML 2022
BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model
EMNLP 2022
Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models
EMNLP 2022
ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding
INTERSPEECH 2022
Better Intermediates Improve CTC Inference
INTERSPEECH 2022
ASR2K: Speech Recognition for Around 2000 Languages without Audio
INTERSPEECH 2022
Streaming Automatic Speech Recognition with Re-blocking Processing Based on Integrated Voice Activity Detection
INTERSPEECH 2022
Memory-Efficient Training of RNN-Transducer with Sampled Softmax
INTERSPEECH 2022
Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis
INTERSPEECH 2022
When Is TTS Augmentation Through a Pivot Language Useful?
INTERSPEECH 2022
Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation
INTERSPEECH 2022
Two-Pass Low Latency End-to-End Spoken Language Understanding
INTERSPEECH 2022
Continuous Speech Separation Using Speaker Inventory for Long Recording
INTERSPEECH 2021
ESPnet-ST IWSLT 2021 Offline Speech Translation System
ACL 2021
Self-Guided Curriculum Learning for Neural Machine Translation
ACL 2021
Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolรณxochitl Mixtec
EACL 2021
ESPnet-ST IWSLT 2021 Offline Speech Translation System
IJCNLP 2021
Self-Guided Curriculum Learning for Neural Machine Translation
IJCNLP 2021
Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios
INTERSPEECH 2021
Acoustic Event Detection with Classifier Chains
INTERSPEECH 2021
SUPERB: Speech Processing Universal PERformance Benchmark
INTERSPEECH 2021
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding
INTERSPEECH 2021
SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition
INTERSPEECH 2021
Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics
INTERSPEECH 2021
Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021
INTERSPEECH 2021
Multi-Mode Transformer Transducer with Stochastic Future Context
INTERSPEECH 2021
Differentiable Allophone Graphs for Language-Universal Speech Recognition
INTERSPEECH 2021
Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization
INTERSPEECH 2021
Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers
INTERSPEECH 2021
Leveraging Pre-Trained Language Model for Speech Sentiment Analysis
INTERSPEECH 2021
Speaker Verification-Based Evaluation of Single-Channel Speech Separation
INTERSPEECH 2021
Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker
INTERSPEECH 2021
GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio
INTERSPEECH 2021
Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain
INTERSPEECH 2021
Toward Streaming ASR with Non-Autoregressive Insertion-Based Model
INTERSPEECH 2021
Layer Pruning on Demand with Intermediate CTC
INTERSPEECH 2021
Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models
INTERSPEECH 2021
End-to-end ASR to jointly predict transcriptions and linguistic annotations
NAACL 2021
Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation
NAACL 2021
Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks
NAACL 2021
Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation
NAACL 2021
Learning Speaker Embedding from Text-to-Speech
INTERSPEECH 2020
End-to-End ASR with Adaptive Span Self-Attention
INTERSPEECH 2020
Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict
INTERSPEECH 2020
Speaker-Conditional Chain Model for Speech Separation and Extraction
INTERSPEECH 2020
End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming
INTERSPEECH 2020
End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors
INTERSPEECH 2020
Insertion-Based Modeling for End-to-End Automatic Speech Recognition
INTERSPEECH 2020
Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals
NIPS 2020
ESPnet-ST: All-in-One Speech Translation Toolkit
ACL 2020
Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition
INTERSPEECH 2019
Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis
INTERSPEECH 2019
Pretraining by Backtranslation for End-to-End ASR in Low-Resource Settings
INTERSPEECH 2019
Study of the Performance of Automatic Speech Recognition Systems in Speakers with Parkinsonโs Disease
INTERSPEECH 2019
End-to-End Neural Speaker Diarization with Permutation-Free Objectives
INTERSPEECH 2019
Vectorized Beam Search for CTC-Attention-Based Speech Recognition
INTERSPEECH 2019
Semi-Supervised Sequence-to-Sequence ASR Using Unpaired Speech and Text
INTERSPEECH 2019
End-to-End Multilingual Multi-Speaker Speech Recognition
INTERSPEECH 2019
Massively Multilingual Adversarial Speech Recognition
NAACL 2019
Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems
INTERSPEECH 2019
Speaker Recognition Benchmark Using the CHiME-5 Corpus
INTERSPEECH 2019
Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration
INTERSPEECH 2019
End-to-End SpeakerBeam for Single Channel Target Speech Recognition
INTERSPEECH 2019
The JHU/KyotoU Speech Translation System for IWSLT 2018
EMNLP 2018
Student-Teacher Learning for BLSTM Mask-based Speech Enhancement
INTERSPEECH 2018
Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge
INTERSPEECH 2018
Auxiliary Feature Based Adaptation of End-to-end ASR Systems
INTERSPEECH 2018
Multi-Modal Data Augmentation for End-to-end ASR
INTERSPEECH 2018
ESPnet: End-to-End Speech Processing Toolkit
INTERSPEECH 2018
Effectiveness of Single-Channel BLSTM Enhancement for Language Identification
INTERSPEECH 2018
Building State-of-the-art Distant Speech Recognition Using the CHiME-4 Challenge with a Setup of Speech Enhancement Baseline
INTERSPEECH 2018
The Fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, Task and Baselines
INTERSPEECH 2018
Multi-Head Decoder for End-to-End Speech Recognition
INTERSPEECH 2018
Semi-Supervised End-to-End Speech Recognition
INTERSPEECH 2018
A Purely End-to-End System for Multi-speaker Speech Recognition
ACL 2018
Joint CTC/attention decoding for end-to-end speech recognition
ACL 2017
Multichannel End-to-end Speech Recognition
ICML 2017
Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM
INTERSPEECH 2017
Coupled Initialization of Multi-Channel Non-Negative Matrix Factorization Based on Spatial and Spectral Information
INTERSPEECH 2017
Semi-Supervised Learning of a Pronunciation Dictionary from Disjoint Phonemic Transcripts and Text
INTERSPEECH 2017
Context-Sensitive and Role-Dependent Spoken Language Understanding Using Bidirectional and Attention LSTMs
INTERSPEECH 2016
Data Selection by Sequence Summarizing Neural Network in Mismatch Condition Training
INTERSPEECH 2016
Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks
INTERSPEECH 2016
Single-Channel Multi-Speaker Separation Using Deep Clustering
INTERSPEECH 2016
Statistical Dialogue Management using Intention Dependency Graph
IJCNLP 2013