Kai Yu

120 papers · 2006–2026 · 13 conferences · across top CS/AI conferences

Achievements

+18 more ↓

🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (9) 🗺️ Taxonomy Completionist (38) 🐣 Hot Topic Early Bird

🌈 Renaissance Researcher (9) 🌉 Interdisciplinary Bridge 🐝 Cross-Pollinator (7) 🏠 Conference Loyalist (36) 🌟 Keyword Trendsetter Combo (5) 🔬 Deep Specialist (20) 🌱 Topic Pioneer 🏆 Keyword Champion 🧬 Topic Evolution 👥 Mega-Team (23) 🤝 Dynamic Duo (48) 📈 Trend Setter 🚀 Conference Pioneer 🔥 Unstoppable (14) ❓ The Questioner (4) ⚡ Prolific Year (17) 💎 Century Club (116) 🗃️ Keyword Collector (116)

Conferences

INTERSPEECH (36) ACL (18) EMNLP (17) NIPS (12) AAAI (9) COLING (7) NAACL (5) ICCV (4) ICML (4) IJCNLP (3) CVPR (2) EACL (2) MICCAI (1)

Top co-authors

Lu Chen (48) Su Zhu (19) Ruisheng Cao (18) Xie Chen (13) Hongshen Xu (11) Yanmin Qian (11) Zhi Chen (10) Shuai Wang (10) Mengyue Wu (9) Yanbin Zhao (7)

Keywords

large language model (15) semantic parsing (9) speech synthesis (7) domain adaptation (6) data augmentation (6) speaker verification (5) automatic speech recognition (5) speaker embedding (5) knowledge distillation (5) graph neural network (5) long short-term memory (4) model compression (4) vector quantization (4) connectionist temporal classification (4) text-to-speech synthesis (4) speech recognition (4) dialogue state tracking (4) transfer learning (4) unsupervised learning (4) semi-supervised learning (4)

Papers

MergeDNA: Context-Aware Genome Modeling with Dynamic Tokenization Through Token Merging AAAI 2026 BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction EACL 2026 AHAMask: Reliable Task Specification for Large Audio Language Models Without Instructions AAAI 2026 Phased One-Step Adversarial Equilibrium for Video Diffusion Models AAAI 2026 Alignment for Efficient Tool Calling of Large Language Models EMNLP 2025 When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models EMNLP 2025 MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation NAACL 2025 Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models COLING 2025 GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement ACL 2025 NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering ACL 2025 F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching ACL 2025 SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training ACL 2025 From Generalist to Specialist: A Survey of Large Language Models for Chemistry COLING 2025 VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization AAAI 2025 Reducing Tool Hallucination via Reliability Alignment ICML 2025 ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary COLING 2025 Heads up! Large Language Models Can Perform Tasks Without Your Instruction via Selective Attention Head Masking ICML 2025 Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video ICCV 2025 URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models EMNLP 2025 Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation EMNLP 2025 AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference EMNLP 2024 Multilingual Brain Surgeon: Large Language Models Can Be Compressed Leaving No Language behind COLING 2024 SPEADO: Segmentation and Punctuation for Ancient Chinese Texts via Example Augmentation and Decoding Optimization COLING 2024 DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors CVPR 2024 UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling MICCAI 2024 DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation INTERSPEECH 2024 Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? NIPS 2024 On the Effectiveness of Acoustic BPE in Decoder-Only TTS INTERSPEECH 2024 Text-aware Speech Separation for Multi-talker Keyword Spotting INTERSPEECH 2024 FakeSound: Deepfake General Audio Detection INTERSPEECH 2024 UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding AAAI 2024 SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research AAAI 2024 Evolving Subnetwork Training for Large Language Models ICML 2024 CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions NAACL 2024 IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation ACL 2024 Sparsity-Accelerated Training for Large Language Models ACL 2024 Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks COLING 2024 UnSE: Unsupervised Speech Enhancement Using Optimal Transport INTERSPEECH 2023 PointGPT: Auto-regressively Generative Pre-training from Point Clouds NIPS 2023 Large Language Models Are Semi-Parametric Reinforcement Learning Agents NIPS 2023 TeCS: A Dataset and Benchmark for Tense Consistency of Machine Translation ACL 2023 SPM: A Split-Parsing Method for Joint Multi-Intent Detection and Slot Filling ACL 2023 Exploring Schema Generalizability of Text-to-SQL ACL 2023 CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset ACL 2023 ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought EMNLP 2023 Towards Instance-adaptive Inference for Federated Learning ICCV 2023 Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning ICCV 2023 DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech INTERSPEECH 2023 Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation INTERSPEECH 2023 How ChatGPT is Robust for Spoken Language Understanding? INTERSPEECH 2023 ReCLR: Reference-Enhanced Contrastive Learning of Audio Representation for Depression Detection INTERSPEECH 2023 Enhance Temporal Relations in Audio Captioning with Sound Event Detection INTERSPEECH 2023 AdapterShare: Task Correlation Modeling with Adapter Differentiation EMNLP 2022 TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages NAACL 2022 The AISP-SJTU Translation System for WMT 2022 EMNLP 2022 VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature INTERSPEECH 2022 MSDWild: Multi-modal Speaker Diarization Dataset in the Wild INTERSPEECH 2022 The AISP-SJTU Simultaneous Translation System for IWSLT 2022 ACL 2022 D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat EMNLP 2022 Efficient Speech Enhancement with Neural Homomorphic Synthesis INTERSPEECH 2022 META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI EMNLP 2022 WebSRC: A Dataset for Web-Based Structural Reading Comprehension EMNLP 2021 LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching AAAI 2021 Glyph Enhanced Chinese Character Pre-Training for Lexical Sememe Prediction EMNLP 2021 Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL IJCNLP 2021 Rich Prosody Diversity Modelling with Phone-Level Mixture Density Network INTERSPEECH 2021 Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR INTERSPEECH 2021 A Lightweight Framework for Online Voice Activity Detection in the Wild INTERSPEECH 2021 Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL ACL 2021 LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations ACL 2021 LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations IJCNLP 2021 ShadowGNN: Graph Projection Neural Network for Text-to-SQL Parser NAACL 2021 Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection INTERSPEECH 2020 Neural Homomorphic Vocoder INTERSPEECH 2020 Unsupervised Dual Paraphrasing for Two-stage Semantic Parsing ACL 2020 Efficient Context and Schema Fusion Networks for Multi-Domain Dialogue State Tracking EMNLP 2020 Schema-Guided Multi-Domain Dialogue State Tracking with Graph Attention Neural Networks AAAI 2020 Neural Graph Matching Networks for Chinese Short Text Matching ACL 2020 Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders AAAI 2020 Line Graph Enhanced AMR-to-Text Generation with Mix-Order Graph Attention Networks ACL 2020 Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection INTERSPEECH 2020 Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding INTERSPEECH 2020 On the Usage of Phonetic Information for Text-Independent Speaker Embedding Extraction INTERSPEECH 2019 Semantic Parsing with Dual Learning ACL 2019 Data Augmentation with Atomic Templates for Spoken Language Understanding IJCNLP 2019 The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge INTERSPEECH 2019 Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification INTERSPEECH 2019 Joint Decoding of CTC Based Systems for Speech Recognition INTERSPEECH 2019 Cross-Domain Replay Spoofing Attack Detection Using Domain Adversarial Training INTERSPEECH 2019 Data Augmentation with Atomic Templates for Spoken Language Understanding EMNLP 2019 Binarized LSTM Language Model NAACL 2018 Knowledge Distillation for Sequence Model INTERSPEECH 2018 Towards Universal Dialogue State Tracking EMNLP 2018 Angular Softmax for Short-Duration Text-independent Speaker Verification INTERSPEECH 2018 Structured Dialogue Policy with Graph Neural Networks COLING 2018 High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder INTERSPEECH 2018 Structured Word Embedding for Low Memory Neural Network Language Model INTERSPEECH 2018 On-line Dialogue Policy Learning with Companion Teaching EACL 2017 What Does the Speaker Embedding Encode? INTERSPEECH 2017 Comparison of Modeling Target in LSTM-RNN Duration Model INTERSPEECH 2017 Discrete Duration Model for Speech Synthesis INTERSPEECH 2017 Binary Deep Neural Networks for Speech Recognition INTERSPEECH 2017 Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning EMNLP 2017 Affordable On-line Dialogue Policy Learning EMNLP 2017 Hybrid Dialogue State Tracking for Real World Human-to-Human Dialogues INTERSPEECH 2016 Unrestricted Vocabulary Keyword Spotting Using LSTM-CTC INTERSPEECH 2016 Phone Synchronous Decoding with CTC Lattice INTERSPEECH 2016 Text Flow: A Unified Text Detection System in Natural Scene Images ICCV 2015 Deep Multiple Instance Learning for Image Classification and Auto-Annotation CVPR 2015 Communication Efficient Distributed Machine Learning with the Parameter Server NIPS 2014 Smooth Sparse Coding via Marginal Regression for Learning Sparse Representations ICML 2013 Deep Learning of Invariant Features via Simulated Fixations in Video NIPS 2012 Deep Coding Network NIPS 2010 Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning ACL 2010 Nonlinear Learning using Local Coordinate Coding NIPS 2009 Stochastic Relational Models for Large-scale Dyadic Data using MCMC NIPS 2008 Deep Learning with Kernel Regularization for Visual Recognition NIPS 2008 Predictive Matrix-Variate t Models NIPS 2007 Gaussian Process Models for Link Analysis and Transfer Learning NIPS 2007 Stochastic Relational Models for Discriminative Link Prediction NIPS 2006