Florian Metze

55 papers · 2007–2026 · 11 conferences · across top CS/AI conferences

Achievements

+15 more ↓

🗺️ Taxonomy Completionist (12) 🧭 Keyword Pioneer 🌈 Renaissance Researcher (5) 🌉 Interdisciplinary Bridge 🐣 Hot Topic Early Bird

🌉 Interdisciplinary Bridge 🗺️ Taxonomy Completionist (12) 🐝 Cross-Pollinator (8) 🏠 Conference Loyalist (22) 🌟 Keyword Trendsetter Combo (5) 🤝 Dynamic Duo (12) 🧬 Topic Evolution 🏆 Keyword Champion 🔬 Deep Specialist (12) 🔥 Unstoppable (8) 🚀 Conference Pioneer ⚡ Prolific Year (9) 📈 Trend Setter 🗃️ Keyword Collector (209) 💎 Century Club (54)

Conferences

INTERSPEECH (22) ACL (7) EMNLP (7) NAACL (5) EACL (4) NIPS (3) CVPR (2) IJCNLP (2) AAAI (1) ICCV (1) ICLR (1)

Top co-authors

Siddharth Dalmia (12) Xinjian Li (8) Po-Yao Huang (8) Shinji Watanabe (7) Juncheng Li (7) Alan W. Black (5) Christoph Feichtenhofer (5) Alan W Black (5) Brian Yan (4) Vikas Raunak (4)

Keywords

automatic speech recognition (10) multimodal learning (7) zero-shot learning (5) attention mechanism (5) connectionist temporal classification (4) end-to-end speech recognition (4) low-resource language (4) speech recognition (4) acoustic model (4) end-to-end model (4) word error rate (3) self-supervised learning (3) contrastive learning (3) transfer learning (3) word embedding (3) video understanding (3) deep neural network (3) conversational context (3) deep learning (2) spoken language understanding (2)

Papers

Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning EACL 2026 CTC Alignments Improve Autoregressive Translation EACL 2023 AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification INTERSPEECH 2022 Masked Autoencoders that Listen NIPS 2022 Normalized Contrastive Learning for Text-Video Retrieval EMNLP 2022 Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models EMNLP 2022 On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization EMNLP 2022 Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble ACL 2022 ASR2K: Speech Recognition for Around 2000 Languages without Audio INTERSPEECH 2022 Self-Supervised Object Detection From Audio-Visual Correspondence CVPR 2022 Hierarchical Phone Recognition with Compositional Phonetics INTERSPEECH 2021 Multimodal Speech Summarization Through Semantic Concept Learning INTERSPEECH 2021 Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers NIPS 2021 VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding ACL 2021 How2Sign: A Large-Scale Multimodal Dataset for Continuous American Sign Language CVPR 2021 Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models NAACL 2021 NoiseQA: Challenge Set Evaluation for User-Centric Question Answering EACL 2021 VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding EMNLP 2021 Space-Time Crop & Attend: Improving Cross-Modal Video Representation Learning ICCV 2021 Support-set bottlenecks for video-text representation learning ICLR 2021 Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks NAACL 2021 VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding IJCNLP 2021 Differentiable Allophone Graphs for Language-Universal Speech Recognition INTERSPEECH 2021 Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding INTERSPEECH 2021 On Long-Tailed Phenomena in Neural Machine Translation EMNLP 2020 Multimodal Speech Recognition with Unstructured Audio Masking EMNLP 2020 Towards Context-Aware End-to-End Code-Switching Speech Recognition INTERSPEECH 2020 Fine-Grained Grounding for Multimodal Speech Recognition EMNLP 2020 On Dimensional Linguistic Properties of the Word Embedding Space ACL 2020 Towards Zero-Shot Learning for Automatic Phonemic Transcription AAAI 2020 Contextual RNN-T for Open Domain ASR INTERSPEECH 2020 Multimodal Abstractive Summarization for How2 Videos ACL 2019 Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion ACL 2019 Adversarial Music: Real world Audio Adversary against Wake-word Detection System NIPS 2019 Effective Dimensionality Reduction for Word Embeddings ACL 2019 Multilingual Speech Recognition with Corpus Relatedness Sampling INTERSPEECH 2019 Survey Talk: Multimodal Processing of Speech and Language INTERSPEECH 2019 SANTLR: Speech Annotation Toolkit for Low Resource Languages INTERSPEECH 2019 Cross-Attention End-to-End ASR for Two-Party Conversations INTERSPEECH 2019 Acoustic-to-Word Models with Conversational Context Information NAACL 2019 The ACLEW DiViMe: An Easy-to-use Diarization Tool INTERSPEECH 2018 Comparing the Max and Noisy-Or Pooling Functions in Multiple Instance Learning for Weakly Supervised Sequence Learning Tasks INTERSPEECH 2018 Subword and Crossword Units for CTC Acoustic Models INTERSPEECH 2018 Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection INTERSPEECH 2018 Comparison of Decoding Strategies for CTC Acoustic Models INTERSPEECH 2017 A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification INTERSPEECH 2017 Open-Domain Audio-Visual Speech Recognition: A Deep Learning Approach INTERSPEECH 2016 Manipulating Word Lattices to Incorporate Human Corrections INTERSPEECH 2016 Experiences with Shared Resources for Research and Education in Speech and Language Processing INTERSPEECH 2016 Virtual Machines and Containers as a Platform for Experimentation INTERSPEECH 2016 Augmenting Translation Models with Simulated Acoustic Confusions for Improved Spoken Language Translation EACL 2014 Semantics for Large-Scale Multimedia: New Challenges for NLP ACL 2014 Prosody-Based Unsupervised Speech Summarization with Two-Layer Mutually Reinforced Random Walk IJCNLP 2013 Intra-Speaker Topic Modeling for Improved Multi-Party Meeting Summarization with Integrated Random Walk NAACL 2012 On using Articulatory Features for Discriminative Speaker Adaptation NAACL 2007