Yong Man Ro

45 papers · 2018–2026 · 9 conferences · across top CS/AI conferences

Achievements

+14 more ↓

🌍 Conference Polyglot (9) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (5) 🏃 Academic Marathon (7)

🏃 Academic Marathon (7) 🐝 Cross-Pollinator (4) 🗺️ Taxonomy Completionist (85) 🤝 Dynamic Duo (14) 🔬 Deep Specialist (18) 🏆 Keyword Champion (3) 🧬 Topic Evolution 🚀 Conference Pioneer 🗃️ Keyword Collector (223) 📈 Trend Setter ⚡ Prolific Year (10) 🔥 Unstoppable (8) ❓ The Questioner (2) 💎 Century Club (44)

Conferences

AAAI (11) CVPR (10) ECCV (6) ICCV (6) NIPS (4) EMNLP (3) ACL (2) INTERSPEECH (2) ICML (1)

Top co-authors

Minsu Kim (14) Byung-Kwan Lee (9) Chae Won Kim (7) Joanna Hong (7) Junho Kim (7) Sangmin Lee (7) Jeong Hun Yeo (6) Jeongsoo Choi (6) Hak Gu Kim (6) Se Jin Park (5)

Research topics

Applications (1)

Keywords

multimodal learning (7) lip reading (7) large language model (5) audio-visual speech recognition (4) memory network (4) visual speech recognition (3) model compression (3) vision language model (3) pedestrian detection (3) representation learning (3) visual instruction tuning (3) adversarial robustness (3) object detection (3) causal inference (3) diffusion model (2) large multimodal model (2) visual context (2) audio-visual learning (2) multispectral imaging (2) speech synthesis (2)

Papers

Emotion-Coherent Reasoning for Multimodal LLMs via Emotional Rationale Verifier AAAI 2026 MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens ACL 2025 Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language AAAI 2025 Long-Form Speech Generation with Spoken Language Models ICML 2025 VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models CVPR 2025 SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis CVPR 2025 Zero-AVSR: Zero-Shot Audio-Visual Speech Recognition with LLMs by Learning Language-Agnostic Speech Representations ICCV 2025 What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models EMNLP 2024 Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models NIPS 2024 CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models NIPS 2024 Improving Open Set Recognition via Visual Prompts Distilled from Common-Sense Knowledge AAAI 2024 CoLLaVO: Crayon Large Language and Vision mOdel ACL 2024 AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation CVPR 2024 Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection CVPR 2024 MoAI: Mixture of All Intelligence for Large Language and Vision Models ECCV 2024 TroL: Traversal of Layers for Large Language and Vision Models EMNLP 2024 Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing EMNLP 2024 Intelligible Lip-to-Speech Synthesis with Speech Units INTERSPEECH 2023 Watch or Listen: Robust Audio-Visual Speech Recognition With Visual Corruption Modeling and Reliability Scoring CVPR 2023 Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression CVPR 2023 Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning ICCV 2023 Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge ICCV 2023 Multispectral Invisible Coating: Laminated Visible-Thermal Physical Attack against Multispectral Object Detectors Using Transparent Low-E Films AAAI 2023 Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video AAAI 2023 DiffV2S: Diffusion-Based Video-to-Speech Synthesis with Vision-Guided Speaker Embedding ICCV 2023 SyncTalkFace: Talking Face Generation with Precise Lip-Syncing via Audio-Lip Memory AAAI 2022 Visual Context-driven Audio Feature Enhancement for Robust End-to-End Audio-Visual Speech Recognition INTERSPEECH 2022 Weakly Paired Associative Learning for Sound and Image Representations via Bimodal Associative Memory CVPR 2022 Masking Adversarial Damage: Finding Adversarial Saliency for Robust and Sparse Network CVPR 2022 Audio-Visual Mismatch-Aware Video Retrieval via Association and Adjustment ECCV 2022 Distinguishing Homophenes Using Multi-Head Visual-Audio Memory for Lip Reading AAAI 2022 VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection ECCV 2022 Speaker-Adaptive Lip Reading with User-Dependent Padding ECCV 2022 Towards Versatile Pedestrian Detector with Multisensory-Matching and Multispectral Recalling Memory AAAI 2022 Video Prediction Recalling Long-Term Motion Context via Memory Alignment Learning CVPR 2021 Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck NIPS 2021 Towards a Better Understanding of VR Sickness: Physical Symptom Prediction for VR Contents AAAI 2021 Visual Comfort Aware-Reinforcement Learning for Depth Adjustment of Stereoscopic 3D Images AAAI 2021 Lip to Speech Synthesis with Visual Context Attentional GAN NIPS 2021 Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video ICCV 2021 Robust Small-Scale Pedestrian Detection With Cued Recall via Memory Learning ICCV 2021 SACA Net: Cybersickness Assessment of Individual Viewers for VR Content via Graph-based Symptom Relation Embedding ECCV 2020 Structure Boundary Preserving Segmentation for Medical Image With Ambiguous Boundary CVPR 2020 Mode Variational LSTM Robust to Unseen Modes of Variation: Application to Facial Expression Recognition AAAI 2019 Facial Dynamics Interpreter Network: What are the Important Relations between Local Dynamics for Facial Trait Estimation? ECCV 2018