conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model CVPR 2025

On the Zero-shot Adversarial Robustness of Vision-Language Models: A Truly Zero-shot and Training-free Approach CVPR 2025

Towards General Visual-Linguistic Face Forgery Detection CVPR 2025

Movie Weaver: Tuning-Free Multi-Concept Video Personalization with Anchored Prompts CVPR 2025

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos CVPR 2025

Seeing More with Less: Human-like Representations in Vision Models CVPR 2025

Modeling Thousands of Human Annotators for Generalizable Text-to-Image Person Re-identification CVPR 2025

Accelerating Multimodal Large Language Models by Searching Optimal Vision Token Reduction CVPR 2025

Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key CVPR 2025

Localizing Events in Videos with Multimodal Queries CVPR 2025

PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability CVPR 2025

Language-Guided Salient Object Ranking CVPR 2025

Towards More General Video-based Deepfake Detection through Facial Component Guided Adaptation for Foundation Model CVPR 2025

SP3D: Boosting Sparsely-Supervised 3D Object Detection via Accurate Cross-Modal Semantic Prompts CVPR 2025

VoCo-LLaMA: Towards Vision Compression with Large Language Models CVPR 2025

XLRS-Bench: Could Your Multimodal LLMs Understand Extremely Large Ultra-High-Resolution Remote Sensing Imagery? CVPR 2025

BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding CVPR 2025

Towards All-in-One Medical Image Re-Identification CVPR 2025

SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by Imitating Human Annotator Trajectories CVPR 2025

HeatFormer: A Neural Optimizer for Multiview Human Mesh Recovery CVPR 2025

ResCLIP: Residual Attention for Training-free Dense Vision-language Inference CVPR 2025

CPath-Omni: A Unified Multimodal Foundation Model for Patch and Whole Slide Image Analysis in Computational Pathology CVPR 2025

MVGenMaster: Scaling Multi-View Generation from Any Image via 3D Priors Enhanced Diffusion Model CVPR 2025

HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction CVPR 2025

Vid2Avatar-Pro: Authentic Avatar from Videos in the Wild via Universal Prior CVPR 2025