conftrace_

multimodal learning

4645 papers

Explore in graph

Co-occurring keywords

large language model (13587) vision-language model (2348) visual question answering (1017) video understanding (1658) multi-modal learning (1278) contrastive learning (4032) representation learning (6206) transfer learning (5449) zero-shot learning (3650) vision language model (767)

Papers

SEA-VQA: Southeast Asian Cultural Context Dataset For Visual Question Answering ACL 2024

MemeMind at ArAIEval Shared Task: Generative Augmentation and Feature Fusion for Multimodal Propaganda Detection in Arabic Memes through Advanced Language and Vision Models ACL 2024

AlexUNLP-MZ at ArAIEval Shared Task: Contrastive Learning, LLM Features Extraction and Multi-Objective Optimization for Arabic Multi-Modal Meme Propaganda Detection ACL 2024

TiMix: Text-Aware Image Mixing for Effective Vision-Language Pre-training AAAI 2024

Image Captioning with Multi-Context Synthetic Data AAAI 2024

A Multimodal, Multi-Task Adapting Framework for Video Action Recognition AAAI 2024

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception AAAI 2024

Weakly Supervised Multimodal Affordance Grounding for Egocentric Images AAAI 2024

THGFormer: Time-Aware Hypergraph Learning for Multimodal Social Media Popularity Prediction (Student Abstract) AAAI 2024

Harnessing CLIP for Evidence Identification in Scientific Literature: A Multimodal Approach to Context24 Shared Task ACL 2024

Ancient Chinese Glyph Identification Powered by Radical Semantics ACL 2024

Relational Distant Supervision for Image Captioning without Image-Text Pairs AAAI 2024

HSDreport: Heart Sound Diagnosis with Echocardiography Reports EMNLP 2024

Soft Knowledge Prompt: Help External Knowledge Become a Better Teacher to Instruct LLM in Knowledge-based VQA ACL 2024

Uni-Dubbing: Zero-Shot Speech Synthesis from Visual Articulation ACL 2024

TinyChart: Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging EMNLP 2024

PolCLIP: A Unified Image-Text Word Sense Disambiguation Model via Generating Multimodal Complementary Representations ACL 2024

From Sights to Insights: Towards Summarization of Multimodal Clinical Documents ACL 2024

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling ACL 2024

StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing ACL 2024

Findings of WASSA 2024 Shared Task on Empathy and Personality Detection in Interactions ACL 2024

Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection CVPR 2024

VideoCon: Robust Video-Language Alignment via Contrast Captions CVPR 2024

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups CVPR 2024

Making Visual Sense of Oracle Bones for You and Me CVPR 2024