← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches CVPR 2024

Prompt-Driven Referring Image Segmentation with Instance Contrasting CVPR 2024

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream CVPR 2024

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts CVPR 2024

CoVR: Learning Composed Video Retrieval from Web Video Captions AAAI 2024

Structural Information Guided Multimodal Pre-training for Vehicle-Centric Perception AAAI 2024

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation CVPR 2024

SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation CVPR 2024

Open-Vocabulary Video Relation Extraction AAAI 2024

Revisiting motion information for RGB-Event tracking with MOT philosophy NIPS 2024

PromptAD: Zero-Shot Anomaly Detection Using Text Prompts WACV 2024

Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model CVPR 2024

TOP-ReID: Multi-Spectral Object Re-identification with Token Permutation AAAI 2024

Sound3DVDet: 3D Sound Source Detection Using Multiview Microphone Array and RGB Images WACV 2024

Seeing the Unseen: Visual Common Sense for Semantic Placement CVPR 2024

RegionGPT: Towards Region Understanding Vision Language Model CVPR 2024

Simple Image-Level Classification Improves Open-Vocabulary Object Detection AAAI 2024

EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models CVPR 2024

TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes CVPR 2024

Prompt Highlighter: Interactive Control for Multi-Modal LLMs CVPR 2024

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language AAAI 2024

DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation CVPR 2024

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach CVPR 2024

DisCo: Disentangled Control for Realistic Human Dance Generation CVPR 2024

Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning AAAI 2024