conftrace_

multimodal learning

4622 papers

Explore in graph

Also known as

VLM VLLM MM VLA MLLMS MLM MML MULLM LMM MLLM MMT

Co-occurring keywords

large language model (12755) vision-language model (2235) visual question answering (1000) video understanding (1647) multi-modal learning (1276) contrastive learning (3979) representation learning (6174) transfer learning (5442) zero-shot learning (3637) vision language model (752)

Papers

Towards Surveillance Video-and-Language Understanding: New Dataset Baselines and Challenges CVPR 2024

Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping CVPR 2024

VCoder: Versatile Vision Encoders for Multimodal Large Language Models CVPR 2024

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts CVPR 2024

On Scaling Up a Multilingual Vision and Language Model CVPR 2024

On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation CVPR 2024

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery CVPR 2024

MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric CVPR 2024

MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding CVPR 2024

Exploring the Transferability of Visual Prompting for Multimodal Large Language Models CVPR 2024

LLaFS: When Large Language Models Meet Few-Shot Segmentation CVPR 2024

Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval CVPR 2024

SportsMetrics: Blending Text and Numerical Data to Understand Information Fusion in LLMs ACL 2024

VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning CVPR 2024

MMAD:Multi-modal Movie Audio Description COLING 2024

Breaking Barriers of System Heterogeneity: Straggler-Tolerant Multimodal Federated Learning via Knowledge Distillation IJCAI 2024

Towards Answering Health-related Questions from Medical Videos: Datasets and Approaches COLING 2024

CrossMAE: Cross-Modality Masked Autoencoders for Region-Aware Audio-Visual Pre-Training CVPR 2024

OmniViD: A Generative Framework for Universal Video Understanding CVPR 2024

ChatPose: Chatting about 3D Human Pose CVPR 2024

EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling CVPR 2024

VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning CVPR 2024

Fusion from a Distributional Perspective: A Unified Symbiotic Diffusion Framework for Any Multisource Remote Sensing Data Classification IJCAI 2024

Towards More Unified In-context Visual Understanding CVPR 2024

Discriminative Probing and Tuning for Text-to-Image Generation CVPR 2024