← Learning Types

Machine Learning › Learning Types ›

Multi-Modal Learning

1213 directly classified papers

Papers per year

Papers

EASUM: Enhancing Affective State Understanding Through Joint Sentiment and Emotion Modeling for Multimodal Tasks WACV 2024

The Interspeech 2024 TAUKADIAL Challenge: Multilingual Mild Cognitive Impairment Detection with Multimodal Approach INTERSPEECH 2024

ShapeWalk: Compositional Shape Editing Through Language-Guided Chains CVPR 2024

A Cross-Attention Layer coupled with Multimodal Fusion Methods for Recognizing Depression from Spontaneous Speech INTERSPEECH 2024

Multi-Source Domain Adaptation for Object Detection With Prototype-Based Mean Teacher WACV 2024

HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models CVPR 2024

OmniVec: Learning Robust Representations With Cross Modal Sharing WACV 2024

SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling CVPR 2024

HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative CVPR 2024

Can CLIP Help Sound Source Localization? WACV 2024

Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs CVPR 2024

Language Models as Black-Box Optimizers for Vision-Language Models CVPR 2024

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation INTERSPEECH 2024

Tackling Data Bias in MUSIC-AVQA: Crafting a Balanced Dataset for Unbiased Question-Answering WACV 2024

MIVC: Multiple Instance Visual Component for Visual-Language Models WACV 2024

LLMs are Good Action Recognizers CVPR 2024

Discovering Syntactic Interaction Clues for Human-Object Interaction Detection CVPR 2024

OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning CVPR 2024

Question Aware Vision Transformer for Multimodal Reasoning CVPR 2024

Improving Vision-and-Language Reasoning via Spatial Relations Modeling WACV 2024

InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models CVPR 2024

CAMOT: Camera Angle-Aware Multi-Object Tracking WACV 2024

The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective CVPR 2024

Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection CVPR 2024

Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection WACV 2024