conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

Customized Condition Controllable Generation for Video Soundtrack CVPR 2025

ViCaS: A Dataset for Combining Holistic and Pixel-level Video Understanding using Captions with Grounded Segmentation CVPR 2025

Pixel-aligned RGB-NIR Stereo Imaging and Dataset for Robot Vision CVPR 2025

Can Machines Understand Composition? Dataset and Benchmark for Photographic Image Composition Embedding and Understanding CVPR 2025

Docopilot: Improving Multimodal Models for Document-Level Understanding CVPR 2025

Advancing Myopia To Holism: Fully Contrastive Language-Image Pre-training CVPR 2025

Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Instructional Videos CVPR 2025

Understanding Multi-Task Activities from Single-Task Videos CVPR 2025

Adaptive Keyframe Sampling for Long Video Understanding CVPR 2025

What's in the Image? A Deep-Dive into the Vision of Vision Language Models CVPR 2025

UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation CVPR 2025

MBQ: Modality-Balanced Quantization for Large Vision-Language Models CVPR 2025

Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion CVPR 2025

Seq2Time: Sequential Knowledge Transfer for Video LLM Temporal Grounding CVPR 2025

Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling CVPR 2025

MP-GUI: Modality Perception with MLLMs for GUI Understanding CVPR 2025

ChatGarment: Garment Estimation, Generation and Editing via Large Language Models CVPR 2025

GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs CVPR 2025

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy CVPR 2025

PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models CVPR 2025

Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation CVPR 2025

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary CVPR 2025

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models CVPR 2025

Empowering Large Language Models with 3D Situation Awareness CVPR 2025

EchoTraffic: Enhancing Traffic Anomaly Understanding with Audio-Visual Insights CVPR 2025