← Learning Types

Deep Learning › Learning Types ›

Multi-Modal Learning

3194 directly classified papers

Papers per year

Papers

ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More CVPR 2024

Identification of Necessary Semantic Undertakers in the Causal View for Image-Text Matching AAAI 2024

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding CVPR 2024

Semantic-Aware Video Representation for Few-Shot Action Recognition WACV 2024

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach CVPR 2024

Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding AAAI 2024

Prompt-Driven Referring Image Segmentation with Instance Contrasting CVPR 2024

Have We Ever Encountered This Before? Retrieving Out-of-Distribution Road Obstacles From Driving Scenes WACV 2024

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers NIPS 2024

LiT: Unifying LiDAR "Languages" with LiDAR Translator NIPS 2024

CALVIN: Improved Contextual Video Captioning via Instruction Tuning NIPS 2024

Extending Multi-modal Contrastive Representations NIPS 2024

Enhancing Multi-View Pedestrian Detection Through Generalized 3D Feature Pulling WACV 2024

Enhancing Neural Radiance Fields with Adaptive Multi-Exposure Fusion: A Bilevel Optimization Approach for Novel View Synthesis AAAI 2024

Multilingual Diversity Improves Vision-Language Representations NIPS 2024

Octopus: A Multi-modal LLM with Parallel Recognition and Sequential Understanding NIPS 2024

Enhancing Feature Diversity Boosts Channel-Adaptive Vision Transformers NIPS 2024

CoPL: Contextual Prompt Learning for Vision-Language Understanding AAAI 2024

WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games NIPS 2024

HENASY: Learning to Assemble Scene-Entities for Interpretable Egocentric Video-Language Model NIPS 2024

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) NIPS 2024

Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector AAAI 2024

AFBench: A Large-scale Benchmark for Airfoil Design NIPS 2024

Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning NIPS 2024

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models NIPS 2024