conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation CVPR 2025

Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment CVPR 2025

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models CVPR 2025

Devil is in the Detail: Towards Injecting Fine Details of Image Prompt in Image Generation via Conflict-free Guidance and Stratified Attention CVPR 2025

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization CVPR 2025

Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models CVPR 2025

Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization CVPR 2025

SAM2-LOVE: Segment Anything Model 2 in Language-aided Audio-Visual Scenes CVPR 2025

SynTab-LLaVA: Enhancing Multimodal Table Understanding with Decoupled Synthesis CVPR 2025

DrVideo: Document Retrieval Based Long Video Understanding CVPR 2025

PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing CVPR 2025

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding CVPR 2025

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles CVPR 2025

IDOL: Instant Photorealistic 3D Human Creation from a Single Image CVPR 2025

PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model CVPR 2025

SeCap: Self-Calibrating and Adaptive Prompts for Cross-view Person Re-Identification in Aerial-Ground Networks CVPR 2025

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters CVPR 2025

Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding CVPR 2025

SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing CVPR 2025

DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels CVPR 2025

ShowMak3r: Compositional TV Show Reconstruction CVPR 2025

FSBench: A Figure Skating Benchmark for Advancing Artistic Sports Understanding CVPR 2025

VideoDirector: Precise Video Editing via Text-to-Video Models CVPR 2025

LLM-driven Multimodal and Multi-Identity Listening Head Generation CVPR 2025

VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video Spatiotemporal Augmentation CVPR 2025