← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment ACL 2024

ProtT3: Protein-to-Text Generation for Text-based Protein Understanding ACL 2024

Motion Deblurring via Spatial-Temporal Collaboration of Frames and Events AAAI 2024

STDiff: Spatio-Temporal Diffusion for Continuous Stochastic Video Prediction AAAI 2024

MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning CVPR 2024

CLIM: Contrastive Language-Image Mosaic for Region Representation AAAI 2024

CaMML: Context-Aware Multimodal Learner for Large Models ACL 2024

Direction-Aware Video Demoiréing with Temporal-Guided Bilateral Learning AAAI 2024

Deep Visual-Genetic Biometrics for Taxonomic Classification of Rare Species WACV 2024

Unity in Diversity: Collaborative Pre-training Across Multimodal Medical Sources ACL 2024

Segment beyond View: Handling Partially Missing Modality for Audio-Visual Semantic Segmentation AAAI 2024

Chain of Generation: Multi-Modal Gesture Synthesis via Cascaded Conditional Control AAAI 2024

Learning to Segment Referred Objects from Narrated Egocentric Videos CVPR 2024

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval AAAI 2024

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval ACL 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification AAAI 2024

CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention CVPR 2024

Synergetic Event Understanding: A Collaborative Approach to Cross-Document Event Coreference Resolution with Large Language Models ACL 2024

Introducing GenCeption for Multimodal LLM Benchmarking: You May Bypass Annotations NAACL 2024

3D-STMN: Dependency-Driven Superpoint-Text Matching Network for End-to-End 3D Referring Expression Segmentation AAAI 2024

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction CVPR 2024

Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation AAAI 2024

Open-Vocabulary Video Relation Extraction AAAI 2024

Exploring Chain-of-Thought for Multi-modal Metaphor Detection ACL 2024

CoVR: Learning Composed Video Retrieval from Web Video Captions AAAI 2024