multimodal learning

4622 papers

Explore in graph

Also known as

VLM VLLM MM VLA MLLMS MLM MML MULLM LMM MLLM MMT

Co-occurring keywords

large language model (12755) vision-language model (2235) visual question answering (1000) video understanding (1647) multi-modal learning (1276) contrastive learning (3979) representation learning (6174) transfer learning (5442) zero-shot learning (3637) vision language model (752)

Papers

Speech and Text Analysis for Multimodal Addressee Detection in Human-Human-Computer Interaction INTERSPEECH 2017

Dual Track Multimodal Automatic Learning through Human-Robot Interaction IJCAI 2017

Automatic Generation of Grounded Visual Questions IJCAI 2017

Multimodal Storytelling via Generative Adversarial Imitation Learning IJCAI 2017

Multimodal Learning and Reasoning for Visual Question Answering NIPS 2017

Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection INTERSPEECH 2017

Procedural Text Generation from an Execution Video IJCNLP 2017

NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition INTERSPEECH 2017

Hierarchical Question-Image Co-Attention for Visual Question Answering NIPS 2016

Visual Question Answering with Question Representation Update (QRU) NIPS 2016

Parameter estimation of Japanese predicate argument structure analysis model using eye gaze information COLING 2016

Diagnostic Prediction Using Discomfort Drawings with IBTM MLHC 2016

Combining Heterogeneous User Generated Data to Sense Well-being COLING 2016

Exploring Collections of Multimedia Archives Through Innovative Interfaces in the Context of Digital Humanities INTERSPEECH 2016

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR INTERSPEECH 2016

Predicting Affective Dimensions Based on Self Assessed Depression Severity INTERSPEECH 2016

Where to Look: Focus Regions for Visual Question Answering CVPR 2016

Training and Evaluating Multimodal Word Embeddings with Large-scale Web Annotated Images NIPS 2016

Introducing the Turbo-Twin-HMM for Audio-Visual Speech Enhancement INTERSPEECH 2016

MSR-VTT: A Large Video Description Dataset for Bridging Video and Language CVPR 2016

Combining CNN and BLSTM to Extract Textual and Acoustic Features for Recognizing Stances in Mandarin Ideological Debate Competition INTERSPEECH 2016

Automatic Genre and Show Identification of Broadcast Media INTERSPEECH 2016

Multimodal Residual Learning for Visual QA NIPS 2016

A Neural Network Approach for Knowledge-Driven Response Generation COLING 2016

Challenges in multimodal gesture recognition JMLR 2016