conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving AAAI 2025

Recoverable Compression: A Multimodal Vision Token Recovery Mechanism Guided by Text Information AAAI 2025

Comprehensive Multi-Modal Prototypes Are Simple and Effective Classifiers for Vast-Vocabulary Object Detection AAAI 2025

SVGBuilder: Component-Based Colored SVG Generation with Text-Guided Autoregressive Transformers AAAI 2025

VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis AAAI 2025

EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditions AAAI 2025

DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech AAAI 2025

Graphic Design with Large Multimodal Model AAAI 2025

Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval AAAI 2025

Zero-Shot Scene Change Detection AAAI 2025

Elevating Flow-Guided Video Inpainting with Reference Generation AAAI 2025

DynASyn: Multi-Subject Personalization Enabling Dynamic Action Synthesis AAAI 2025

MASS: Overcoming Language Bias in Image-Text Matching AAAI 2025

LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba AAAI 2025

Harmonious Music-driven Group Choreography with Trajectory-Controllable Diffusion AAAI 2025

Boundary-Aware Temporal Dynamic Pseudo-Supervision Pairs Generation for Zero-Shot Natural Language Video Localization AAAI 2025

Occlusion-Insensitive Talking Head Video Generation via Facelet Compensation AAAI 2025

InstructOCR: Instruction Boosting Scene Text Spotting AAAI 2025

Multi-Pair Temporal Sentence Grounding via Multi-Thread Knowledge Transfer Network AAAI 2025

HDLayout: Hierarchical and Directional Layout Planning for Arbitrary Shaped Visual Text Generation AAAI 2025

Foundation Model Driven Appearance Extraction for Robust Multiple Object Tracking AAAI 2025

AIM: Let Any Multimodal Large Language Models Embrace Efficient In-Context Learning AAAI 2025

TC-LLaVA: Rethinking the Transfer of LLava from Image to Video Understanding with Temporal Considerations AAAI 2025

Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification AAAI 2025

MaintaAvatar: A Maintainable Avatar Based on Neural Radiance Fields by Continual Learning AAAI 2025