conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

Controllable Human Image Generation with Personalized Multi-Garments CVPR 2025

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs CVPR 2025

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation CVPR 2025

Video-Guided Foley Sound Generation with Multimodal Controls CVPR 2025

StyleStudio: Text-Driven Style Transfer with Selective Control of Style Elements CVPR 2025

CTRL-O: Language-Controllable Object-Centric Visual Representation Learning CVPR 2025

Text Augmented Correlation Transformer For Few-shot Classification & Segmentation CVPR 2025

ICT: Image-Object Cross-Level Trusted Intervention for Mitigating Object Hallucination in Large Vision-Language Models CVPR 2025

PreciseCam: Precise Camera Control for Text-to-Image Generation CVPR 2025

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks CVPR 2025

Temporal Action Detection Model Compression by Progressive Block Drop CVPR 2025

DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment CVPR 2025

STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training CVPR 2025

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows CVPR 2025

PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models CVPR 2025

AutoPresent: Designing Structured Visuals from Scratch CVPR 2025

VisionArena: 230k Real World User-VLM Conversations with Preference Labels CVPR 2025

SemGeoMo: Dynamic Contextual Human Motion Generation with Semantic and Geometric Guidance CVPR 2025

FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts CVPR 2025

FactCheXcker: Mitigating Measurement Hallucinations in Chest X-ray Report Generation Models CVPR 2025

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing CVPR 2025

Reasoning to Attend: Try to Understand How <SEG> Token Works CVPR 2025

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution CVPR 2025

JTD-UAV: MLLM-Enhanced Joint Tracking and Description Framework for Anti-UAV Systems CVPR 2025

Font-Agent: Enhancing Font Understanding with Large Language Models CVPR 2025