Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

DRAG: Dynamic Region-Aware GCN for Privacy-Leaking Image Detection AAAI 2022

Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective AAAI 2022

Understanding Attention for Vision-and-Language Tasks COLING 2022

MuKEA: Multimodal Knowledge Extraction and Accumulation for Knowledge-Based Visual Question Answering CVPR 2022

Learning Program Representations for Food Images and Cooking Recipes CVPR 2022

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality EMNLP 2022

Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners NIPS 2022

ELEVATER: A Benchmark and Toolkit for Evaluating Language-Augmented Visual Models NIPS 2022

Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts NIPS 2022

I2DFormer: Learning Image to Document Attention for Zero-Shot Image Classification NIPS 2022

TaiSu: A 166M Large-scale High-Quality Dataset for Chinese Vision-Language Pre-training NIPS 2022

CAESAR: An Embodied Simulator for Generating Multimodal Referring Expression Datasets NIPS 2022

SAMURAI: Shape And Material from Unconstrained Real-world Arbitrary Image collections NIPS 2022

MM-GATBT: Enriching Multimodal Representation Using Graph Attention Network NAACL 2022

Analysing the Correlation between Lexical Ambiguity and Translation Quality in a Multimodal Setting using WordNet NAACL 2022

Beyond Emotion: A Multi-Modal Dataset for Human Desire Understanding NAACL 2022

Open-Domain, Content-Based, Multi-Modal Fact-Checking of Out-of-Context Images via Online Resources CVPR 2022

3MASSIV: Multilingual, Multimodal and Multi-Aspect Dataset of Social Media Short Videos CVPR 2022

A Proposal-Based Paradigm for Self-Supervised Sound Source Localization in Videos CVPR 2022

MERLOT Reserve: Neural Script Knowledge Through Vision and Language and Sound CVPR 2022

Region-Aware Face Swapping CVPR 2022

NLX-GPT: A Model for Natural Language Explanations in Vision and Vision-Language Tasks CVPR 2022

An Empirical Study of Training End-to-End Vision-and-Language Transformers CVPR 2022

OmniFusion: 360 Monocular Depth Estimation via Geometry-Aware Fusion CVPR 2022

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts ICML 2022