← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

Comprehending and Ordering Semantics for Image Captioning CVPR 2022

Learning Program Representations for Food Images and Cooking Recipes CVPR 2022

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-Based Visual Question Answering CVPR 2022

ScanQA: 3D Question Answering for Spatial Scene Understanding CVPR 2022

Dynamic MLP for Fine-Grained Image Classification by Leveraging Geographical and Temporal Information CVPR 2022

Deep Safe Multi-View Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase CVPR 2022

Interact Before Align: Leveraging Cross-Modal Knowledge for Domain Adaptive Action Recognition CVPR 2022

Vision-Language Pre-Training for Boosting Scene Text Detectors CVPR 2022

Multi-View Depth Estimation by Fusing Single-View Depth Probability With Multi-View Geometry CVPR 2022

Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement by Re-Synthesis CVPR 2022

Balanced Multimodal Learning via On-the-Fly Gradient Modulation CVPR 2022

FLAVA: A Foundational Language and Vision Alignment Model CVPR 2022

Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation CVPR 2022

Open-Vocabulary Instance Segmentation via Robust Cross-Modal Pseudo-Labeling CVPR 2022

Towards Implicit Text-Guided 3D Shape Generation CVPR 2022

Cross Modal Retrieval With Querybank Normalisation CVPR 2022

Align and Prompt: Video-and-Language Pre-Training With Entity Prompts CVPR 2022

3DJCG: A Unified Framework for Joint Dense Captioning and Visual Grounding on 3D Point Clouds CVPR 2022

Sub-Word Level Lip Reading With Visual Attention CVPR 2022

LiT: Zero-Shot Transfer With Locked-Image Text Tuning CVPR 2022

End-to-End Generative Pretraining for Multimodal Video Captioning CVPR 2022

ZeroCap: Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic CVPR 2022

CALM: Constrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis INTERSPEECH 2022

DeToxy: A Large-Scale Multimodal Dataset for Toxicity Classification in Spoken Utterances INTERSPEECH 2022

Speaker recognition-assisted robust audio deepfake detection INTERSPEECH 2022