conftrace_

multimodal learning

4645 papers

Explore in graph

Co-occurring keywords

large language model (13587) vision-language model (2348) visual question answering (1017) video understanding (1658) multi-modal learning (1278) contrastive learning (4032) representation learning (6206) transfer learning (5449) zero-shot learning (3650) vision language model (767)

Papers

Is CLIP ideal? No. Can we fix it? Yes! ICCV 2025

IGD: Instructional Graphic Design with Multimodal Layer Generation ICCV 2025

Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models ICCV 2025

AMDANet: Attention-Driven Multi-Perspective Discrepancy Alignment for RGB-Infrared Image Fusion and Segmentation ICCV 2025

DeRIS: Decoupling Perception and Cognition for Enhanced Referring Image Segmentation through Loopback Synergy ICCV 2025

Completing 3D Partial Assemblies with View-Consistent 2D-3D Correspondence ICCV 2025

Streaming VideoLLMs for Real-Time Procedural Video Understanding ICCV 2025

Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation ICCV 2025

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers ICCV 2025

How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation? ICCV 2025

OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding ICCV 2025

Bridging the Gap between Brain and Machine in Interpreting Visual Semantics: Towards Self-adaptive Brain-to-Text Decoding ICCV 2025

Exploiting Frequency Dynamics for Enhanced Multimodal Event-based Action Recognition ICCV 2025

Lumina-Image 2.0: A Unified and Efficient Image Generative Framework ICCV 2025

ATCTrack: Aligning Target-Context Cues with Dynamic Target States for Robust Vision-Language Tracking ICCV 2025

MMGeo: Multimodal Compositional Geo-Localization for UAVs ICCV 2025

Clink! Chop! Thud! - Learning Object Sounds from Real-World Interactions ICCV 2025

TerraMind: Large-Scale Generative Multimodality for Earth Observation ICCV 2025

Learning Beyond Still Frames: Scaling Vision-Language Models with Video ICCV 2025

Hybrid-grained Feature Aggregation with Coarse-to-fine Language Guidance for Self-supervised Monocular Depth Estimation ICCV 2025

Seeing 3D Through 2D Lenses: 3D Few-Shot Class-Incremental Learning via Cross-Modal Geometric Rectification ICCV 2025

Scaling Omni-modal Pretraining with Multimodal Context: Advancing Universal Representation Learning Across Modalities ICCV 2025

FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models ICCV 2025

Aligning Information Capacity Between Vision and Language via Dense-to-Sparse Feature Distillation for Image-Text Matching ICCV 2025

VideoRAG: Retrieval-Augmented Generation over Video Corpus ACL 2025