Po-Yao Huang

23 papers · 2018–2024 · 11 conferences · across top CS/AI conferences

Achievements

+10 more ↓

🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (6) 🏃 Academic Marathon (6) 🌍 Conference Polyglot (11) 🗺️ Taxonomy Completionist (43)

🗺️ Taxonomy Completionist (43) 🧭 Keyword Pioneer 🐣 Hot Topic Early Bird 🤝 Dynamic Duo (10) 🧬 Topic Evolution 💎 Century Club (23) ⚡ Prolific Year (6) 🗃️ Keyword Collector (92) 🔥 Unstoppable (7) 🚀 Conference Pioneer

Conferences

ACL (4) EMNLP (3) ICCV (3) CVPR (2) ECCV (2) ICLR (2) IJCNLP (2) NIPS (2) ICML (1) INTERSPEECH (1) NAACL (1)

Top co-authors

Hu Xu (10) Christoph Feichtenhofer (10) Florian Metze (8) Gargi Ghosh (7) Luke Zettlemoyer (7) Shang-Wen Li (5) Alexander Hauptmann (4) Xiaojun Chang (4) Saining Xie (4) haoqi fan (3)

Keywords

contrastive learning (4) self-supervised learning (4) masked autoencoder (3) multilingual multimodal (3) zero-shot learning (3) multimodal learning (2) image classification (2) image captioning (2) image retrieval (2) object detection (2) representation learning (2) vision-language model (2) attention diversity (2) zero-shot classification (2) multi-head attention (2) visual-semantic embedding (2) speech synthesis (1) attention mechanism (1) efficient training (1) video recognition (1)

Papers

Demystifying CLIP Data ICLR 2024 MoDE: CLIP Data Experts via Clustering CVPR 2024 Altogether: Image Captioning via Re-aligning Alt-text EMNLP 2024 Self-Supervised Audio-Visual Soundscape Stylization ECCV 2024 VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild ACL 2024 Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles ICML 2023 STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition CVPR 2023 CiT: Curation in Training for Effective Vision-Language Data ICCV 2023 Diffusion Models as Masked Autoencoders ICCV 2023 Generating Hashtags for Short-form Videos with Guided Signals ACL 2023 MAViL: Masked Audio-Video Learners NIPS 2023 Masked Autoencoders that Listen NIPS 2022 AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification INTERSPEECH 2022 Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models NAACL 2021 VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding ACL 2021 VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding EMNLP 2021 Space-Time Crop & Attend: Improving Cross-Modal Video Representation Learning ICCV 2021 Support-set bottlenecks for video-text representation learning ICLR 2021 VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding IJCNLP 2021 Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting ACL 2020 Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations IJCNLP 2019 Multi-Head Attention with Diversity for Learning Grounded Multilingual Multimodal Representations EMNLP 2019 RCAA: Relational Context-Aware Agents for Person Search ECCV 2018