Computer Vision › Core AI ›

Multimodal Learning

1257 directly classified papers

Papers per year

Papers

MoReVQA: Exploring Modular Reasoning Models for Video Question Answering CVPR 2024

Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow CVPR 2024

EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI CVPR 2024

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks CVPR 2024

SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos CVPR 2024

Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras CVPR 2024

AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis CVPR 2024

VideoCon: Robust Video-Language Alignment via Contrast Captions CVPR 2024

Few-Shot Adversarial Prompt Learning on Vision-Language Models NIPS 2024

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding NIPS 2024

Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning NIPS 2024

$SE(3)$ Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation NIPS 2024

Beyond Euclidean: Dual-Space Representation Learning for Weakly Supervised Video Violence Detection NIPS 2024

Needle In A Multimodal Haystack NIPS 2024

Adaptive Image Quality Assessment via Teaching Large Multimodal Model to Compare NIPS 2024

Bridge the Modality and Capability Gaps in Vision-Language Model Selection NIPS 2024

Multi-Object Hallucination in Vision Language Models NIPS 2024

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions NIPS 2024

Multi-view Masked Contrastive Representation Learning for Endoscopic Video Analysis NIPS 2024

WATT: Weight Average Test Time Adaptation of CLIP NIPS 2024

Vision-Language Models are Strong Noisy Label Detectors NIPS 2024

4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities NIPS 2024

VideoGUI: A Benchmark for GUI Automation from Instructional Videos NIPS 2024

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control NIPS 2024

XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation NIPS 2024