conftrace_

Shinji Watanabe

186 papers · 2013–2026 · 11 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓

+21 more ↓

🧭 Keyword Pioneer 🗺️ Taxonomy Completionist (39) 🌈 Renaissance Researcher (5) 🌉 Interdisciplinary Bridge 🐣 Hot Topic Early Bird

🧭 Keyword Pioneer 🐣 Hot Topic Early Bird 🌈 Renaissance Researcher (5) 🌟 Keyword Trendsetter Combo (8) 🏠 Conference Loyalist (22) 👑 Domain Dominant (51) 🤝 Dynamic Duo (35) 👑 Triple Crown 🌱 Topic Pioneer 🔬 Deep Specialist (27) 🧬 Topic Evolution 🏆 Keyword Champion (6) 🏆 Grand Slam 👥 Mega-Team (76) 💎 Century Club (180) 🚀 Conference Pioneer 🔥 Unstoppable (10) ❓ The Questioner (3) ⚡ Prolific Year (31) 🗃️ Keyword Collector (199) 📈 Trend Setter

Conferences

INTERSPEECH (120) ACL (26) NAACL (12) EMNLP (6) EACL (4) ICLR (4) ICML (4) AAAI (3) IJCNLP (3) IJCAI (2) NIPS (2)

Top co-authors

Jiatong Shi (37) Brian Yan (27) Siddhant Arora (26) Yifan Peng (23) Xuankai Chang (23) William Chen (19) Jinchuan Tian (15) Karen Livescu (13) Siddharth Dalmia (12) Emiru Tsunoo (10)

Research topics

Speech & Audio (1) Processing (1)

Keywords

automatic speech recognition (52) speech recognition (31) self-supervised learning (22) end-to-end speech recognition (21) speech translation (21) speech enhancement (16) end-to-end model (16) spoken language understanding (15) connectionist temporal classification (12) beam search (10) attention mechanism (9) end-to-end learning (9) speech processing (9) neural network (9) speaker diarization (8) speech separation (8) speech synthesis (8) language model (8) data augmentation (7) transfer learning (7)

Papers

PRiSM: Benchmarking Phone Realization in Speech Models ACL 2026 Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner ACL 2026 CSPB: Conversational Speech Processing Benchmark for Self-supervised Speech Models EACL 2026 BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction EACL 2026 POWSM: A Phonetic Open Whisper-Style Speech Foundation Model ACL 2026 Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception ACL 2026 Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment NAACL 2025 SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models ACL 2025 Summarizing Speech: A Comprehensive Survey EMNLP 2025 Context-aware Dynamic Pruning for Speech Foundation Models ICLR 2025 Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics ICLR 2025 Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks ICLR 2025 OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models ICML 2025 Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization AAAI 2025 ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems NAACL 2025 VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music NAACL 2025 ESPnet-SpeechLM: An Open Speech Language Model Toolkit NAACL 2025 VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning NAACL 2025 Wav2Gloss: Generating Interlinear Glossed Text from Speech ACL 2024 Self-Supervised Speech Representations are More Phonetic than Semantic INTERSPEECH 2024 Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? INTERSPEECH 2024 MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels INTERSPEECH 2024 OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer INTERSPEECH 2024 Neural Blind Source Separation and Diarization for Distant Speech Recognition INTERSPEECH 2024 Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model INTERSPEECH 2024 CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness ACL 2024 Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition INTERSPEECH 2024 AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head AAAI 2024 ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets INTERSPEECH 2024 URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement INTERSPEECH 2024 Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing INTERSPEECH 2024 The Interspeech 2024 Challenge on Speech Processing Using Discrete Units INTERSPEECH 2024 Towards Robust Speech Representation Learning for Thousands of Languages EMNLP 2024 FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model EMNLP 2024 FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN ACL 2024 Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement INTERSPEECH 2024 Cross-Talk Reduction IJCAI 2024 SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics INTERSPEECH 2024 UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions NAACL 2024 EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation INTERSPEECH 2024 MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model INTERSPEECH 2024 CMU’s IWSLT 2024 Simultaneous Speech Translation System ACL 2024 Decoder-only Architecture for Streaming End-to-end Speech Recognition INTERSPEECH 2024 ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models INTERSPEECH 2024 DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding INTERSPEECH 2024 EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios INTERSPEECH 2024 On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models INTERSPEECH 2024 Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss INTERSPEECH 2024 To what extent can ASV systems naturally defend against spoofing attacks? INTERSPEECH 2024 Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting INTERSPEECH 2024 Self-training ASR Guided by Unsupervised ASR Teacher INTERSPEECH 2024 On the Evaluation of Speech Foundation Models for Spoken Language Understanding ACL 2024 OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification ACL 2024 SigMoreFun Submission to the SIGMORPHON Shared Task on Interlinear Glossing ACL 2023 DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models INTERSPEECH 2023 Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization INTERSPEECH 2023 Tensor decomposition for minimization of E2E SLU model toward on-device processing INTERSPEECH 2023 Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding INTERSPEECH 2023 ML-SUPERB: Multilingual Speech Universal PERformance Benchmark INTERSPEECH 2023 Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition INTERSPEECH 2023 Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning INTERSPEECH 2023 BASS: Block-wise Adaptation for Speech Summarization INTERSPEECH 2023 A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning INTERSPEECH 2023 A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks INTERSPEECH 2023 Exploration on HuBERT with Multiple Resolution INTERSPEECH 2023 CTC Alignments Improve Autoregressive Translation EACL 2023 BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS ICLR 2023 Efficient Sequence Transduction by Jointly Predicting Tokens and Durations ICML 2023 Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining IJCAI 2023 Deep Speech Synthesis from MRI-Based Articulatory Representations INTERSPEECH 2023 Bayes Risk Transducer: Transducer with Controllable Alignment Prediction INTERSPEECH 2023 Time-synchronous one-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training INTERSPEECH 2023 Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute INTERSPEECH 2023 Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff INTERSPEECH 2023 UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures NIPS 2023 A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech AAAI 2023 SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks ACL 2023 4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders INTERSPEECH 2023 UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units ACL 2023 ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit ACL 2023 FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN ACL 2023 CMU’s IWSLT 2023 Simultaneous Speech Translation System ACL 2023 Improving Speech Enhancement through Fine-Grained Speech Characteristics INTERSPEECH 2022 TriniTTS: Pitch-controllable End-to-end TTS without External Aligner INTERSPEECH 2022 SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy INTERSPEECH 2022 Residual Language Model for End-to-end Speech Recognition INTERSPEECH 2022 Deep Speech Synthesis from Articulatory Representations INTERSPEECH 2022 Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR INTERSPEECH 2022 SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities ACL 2022 Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble ACL 2022 Findings of the IWSLT 2022 Evaluation Campaign ACL 2022 CMU’s IWSLT 2022 Dialect Speech Translation System ACL 2022 Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis INTERSPEECH 2022 VQ-T: RNN Transducers using Vector-Quantized Prediction Network States INTERSPEECH 2022 Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation INTERSPEECH 2022 Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis INTERSPEECH 2022 Minimum latency training of sequence transducers for streaming end-to-end speech recognition INTERSPEECH 2022 Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models INTERSPEECH 2022 Online Continual Learning of End-to-End Speech Recognition Models INTERSPEECH 2022 End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation INTERSPEECH 2022 Self-supervised Representation Learning for Speech Processing NAACL 2022 Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding ICML 2022 BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model EMNLP 2022 Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models EMNLP 2022 ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding INTERSPEECH 2022 Better Intermediates Improve CTC Inference INTERSPEECH 2022 ASR2K: Speech Recognition for Around 2000 Languages without Audio INTERSPEECH 2022 Streaming Automatic Speech Recognition with Re-blocking Processing Based on Integrated Voice Activity Detection INTERSPEECH 2022 Memory-Efficient Training of RNN-Transducer with Sampled Softmax INTERSPEECH 2022 Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis INTERSPEECH 2022 When Is TTS Augmentation Through a Pivot Language Useful? INTERSPEECH 2022 Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation INTERSPEECH 2022 Two-Pass Low Latency End-to-End Spoken Language Understanding INTERSPEECH 2022 Continuous Speech Separation Using Speaker Inventory for Long Recording INTERSPEECH 2021 ESPnet-ST IWSLT 2021 Offline Speech Translation System ACL 2021 Self-Guided Curriculum Learning for Neural Machine Translation ACL 2021 Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec EACL 2021 ESPnet-ST IWSLT 2021 Offline Speech Translation System IJCNLP 2021 Self-Guided Curriculum Learning for Neural Machine Translation IJCNLP 2021 Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios INTERSPEECH 2021 Acoustic Event Detection with Classifier Chains INTERSPEECH 2021 SUPERB: Speech Processing Universal PERformance Benchmark INTERSPEECH 2021 Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding INTERSPEECH 2021 SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition INTERSPEECH 2021 Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics INTERSPEECH 2021 Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021 INTERSPEECH 2021 Multi-Mode Transformer Transducer with Stochastic Future Context INTERSPEECH 2021 Differentiable Allophone Graphs for Language-Universal Speech Recognition INTERSPEECH 2021 Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization INTERSPEECH 2021 Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers INTERSPEECH 2021 Leveraging Pre-Trained Language Model for Speech Sentiment Analysis INTERSPEECH 2021 Speaker Verification-Based Evaluation of Single-Channel Speech Separation INTERSPEECH 2021 Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker INTERSPEECH 2021 GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio INTERSPEECH 2021 Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain INTERSPEECH 2021 Toward Streaming ASR with Non-Autoregressive Insertion-Based Model INTERSPEECH 2021 Layer Pruning on Demand with Intermediate CTC INTERSPEECH 2021 Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models INTERSPEECH 2021 End-to-end ASR to jointly predict transcriptions and linguistic annotations NAACL 2021 Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation NAACL 2021 Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks NAACL 2021 Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation NAACL 2021 Learning Speaker Embedding from Text-to-Speech INTERSPEECH 2020 End-to-End ASR with Adaptive Span Self-Attention INTERSPEECH 2020 Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict INTERSPEECH 2020 Speaker-Conditional Chain Model for Speech Separation and Extraction INTERSPEECH 2020 End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming INTERSPEECH 2020 End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors INTERSPEECH 2020 Insertion-Based Modeling for End-to-End Automatic Speech Recognition INTERSPEECH 2020 Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals NIPS 2020 ESPnet-ST: All-in-One Speech Translation Toolkit ACL 2020 Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition INTERSPEECH 2019 Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis INTERSPEECH 2019 Pretraining by Backtranslation for End-to-End ASR in Low-Resource Settings INTERSPEECH 2019 Study of the Performance of Automatic Speech Recognition Systems in Speakers with Parkinson’s Disease INTERSPEECH 2019 End-to-End Neural Speaker Diarization with Permutation-Free Objectives INTERSPEECH 2019 Vectorized Beam Search for CTC-Attention-Based Speech Recognition INTERSPEECH 2019 Semi-Supervised Sequence-to-Sequence ASR Using Unpaired Speech and Text INTERSPEECH 2019 End-to-End Multilingual Multi-Speaker Speech Recognition INTERSPEECH 2019 Massively Multilingual Adversarial Speech Recognition NAACL 2019 Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems INTERSPEECH 2019 Speaker Recognition Benchmark Using the CHiME-5 Corpus INTERSPEECH 2019 Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration INTERSPEECH 2019 End-to-End SpeakerBeam for Single Channel Target Speech Recognition INTERSPEECH 2019 The JHU/KyotoU Speech Translation System for IWSLT 2018 EMNLP 2018 Student-Teacher Learning for BLSTM Mask-based Speech Enhancement INTERSPEECH 2018 Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge INTERSPEECH 2018 Auxiliary Feature Based Adaptation of End-to-end ASR Systems INTERSPEECH 2018 Multi-Modal Data Augmentation for End-to-end ASR INTERSPEECH 2018 ESPnet: End-to-End Speech Processing Toolkit INTERSPEECH 2018 Effectiveness of Single-Channel BLSTM Enhancement for Language Identification INTERSPEECH 2018 Building State-of-the-art Distant Speech Recognition Using the CHiME-4 Challenge with a Setup of Speech Enhancement Baseline INTERSPEECH 2018 The Fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, Task and Baselines INTERSPEECH 2018 Multi-Head Decoder for End-to-End Speech Recognition INTERSPEECH 2018 Semi-Supervised End-to-End Speech Recognition INTERSPEECH 2018 A Purely End-to-End System for Multi-speaker Speech Recognition ACL 2018 Joint CTC/attention decoding for end-to-end speech recognition ACL 2017 Multichannel End-to-end Speech Recognition ICML 2017 Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM INTERSPEECH 2017 Coupled Initialization of Multi-Channel Non-Negative Matrix Factorization Based on Spatial and Spectral Information INTERSPEECH 2017 Semi-Supervised Learning of a Pronunciation Dictionary from Disjoint Phonemic Transcripts and Text INTERSPEECH 2017 Context-Sensitive and Role-Dependent Spoken Language Understanding Using Bidirectional and Attention LSTMs INTERSPEECH 2016 Data Selection by Sequence Summarizing Neural Network in Mismatch Condition Training INTERSPEECH 2016 Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks INTERSPEECH 2016 Single-Channel Multi-Speaker Separation Using Deep Clustering INTERSPEECH 2016 Statistical Dialogue Management using Intention Dependency Graph IJCNLP 2013