Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

Visual Storytelling with Question-Answer Plans EMNLP 2023

Mulan: A Multi-Level Alignment Model for Video Question Answering EMNLP 2023

COSMOS: Catching Out-of-Context Image Misuse Using Self-Supervised Learning AAAI 2023

Rethinking Alignment and Uniformity in Unsupervised Image Semantic Segmentation AAAI 2023

MVCINN: Multi-View Diabetic Retinopathy Detection Using a Deep Cross-Interaction Neural Network AAAI 2023

CF-ViT: A General Coarse-to-Fine Method for Vision Transformer AAAI 2023

Learning Deep Hierarchical Features with Spatial Regularization for One-Class Facial Expression Recognition AAAI 2023

Interactive Concept Bottleneck Models AAAI 2023

DINet: Deformation Inpainting Network for Realistic Face Visually Dubbing on High Resolution Video AAAI 2023

Multi-Modal Knowledge Hypergraph for Diverse Image Retrieval AAAI 2023

Video-Text Pre-training with Learned Regions for Retrieval AAAI 2023

Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer AAAI 2023

Cross-Modality Earth Mover’s Distance for Visible Thermal Person Re-identification AAAI 2023

Learning Polysemantic Spoof Trace: A Multi-Modal Disentanglement Network for Face Anti-spoofing AAAI 2023

CALIP: Zero-Shot Enhancement of CLIP with Parameter-Free Attention AAAI 2023

Predict and Use: Harnessing Predicted Gaze to Improve Multimodal Sarcasm Detection EMNLP 2023

Let’s Think Frame by Frame with VIP: A Video Infilling and Prediction Dataset for Evaluating Video Chain-of-Thought EMNLP 2023

Multitask Multimodal Prompted Training for Interactive Embodied Task Completion EMNLP 2023

Referring Expression Comprehension Using Language Adaptive Inference AAAI 2023

Debiased Fine-Tuning for Vision-Language Models by Prompt Regularization AAAI 2023

Visually Grounded Commonsense Knowledge Acquisition AAAI 2023

Improving the Cross-Lingual Generalisation in Visual Question Answering AAAI 2023

Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for Improved Vision-Language Compositionality EMNLP 2023

From Wrong To Right: A Recursive Approach Towards Vision-Language Explanation EMNLP 2023

Hallucination Detection for Grounded Instruction Generation EMNLP 2023