conftrace_

multimodal learning

4645 papers

Explore in graph

Co-occurring keywords

large language model (13587) vision-language model (2348) visual question answering (1017) video understanding (1658) multi-modal learning (1278) contrastive learning (4032) representation learning (6206) transfer learning (5449) zero-shot learning (3650) vision language model (767)

Papers

ALCAP: Alignment-Augmented Music Captioner EMNLP 2023

ORANGE: Text-video Retrieval via Watch-time-aware Heterogeneous Graph Contrastive Learning EMNLP 2023

An Empirical Study of Multimodal Model Merging EMNLP 2023

TILFA: A Unified Framework for Text, Image, and Layout Fusion in Argument Mining EMNLP 2023

Semantists at ImageArg-2023: Exploring Cross-modal Contrastive and Ensemble Models for Multimodal Stance and Persuasiveness Classification EMNLP 2023

ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages EMNLP 2023

KnowComp Submission for WMT23 Sign Language Translation Task EMNLP 2023

Overview of ImageArg-2023: The First Shared Task in Multimodal Argument Mining EMNLP 2023

IUST at ImageArg: The First Shared Task in Multimodal Argument Mining EMNLP 2023

Semi-supervised multimodal coreference resolution in image narrations EMNLP 2023

DiffuVST: Narrating Fictional Scenes with Global-History-Guided Denoising Models EMNLP 2023

Natural Disaster Tweets Classification Using Multimodal Data EMNLP 2023

IC3: Image Captioning by Committee Consensus EMNLP 2023

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations EMNLP 2023

ChatEdit: Towards Multi-turn Interactive Facial Image Editing via Dialogue EMNLP 2023

FACTIFY3M: A benchmark for multimodal fact verification with explainability through 5W Question-Answering EMNLP 2023

Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines EMNLP 2023

VKIE: The Application of Key Information Extraction on Video Text EMNLP 2023

ImageNetVC: Zero- and Few-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories EMNLP 2023

Filling the Image Information Gap for VQA: Prompting Large Language Models to Proactively Ask Questions EMNLP 2023

Towards Zero-shot Relation Extraction in Web Mining: A Multimodal Approach with Relative XML Path EMNLP 2023

Text-guided 3D Human Generation from 2D Collections EMNLP 2023

Controllable Chest X-Ray Report Generation from Longitudinal Representations EMNLP 2023

Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting CVPR 2023

Advancing Visual Grounding With Scene Knowledge: Benchmark and Method CVPR 2023