conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

TG-LLaVA: Text Guided LLaVA via Learnable Latent Embeddings AAAI 2025

PBECount: Prompt-Before-Extract Paradigm for Class-Agnostic Counting AAAI 2025

PlanLLM: Video Procedure Planning with Refinable Large Language Models AAAI 2025

CLIP-MSM: A Multi-Semantic Mapping Brain Representation for Human High-Level Visual Cortex AAAI 2025

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding AAAI 2025

RealPortrait: Realistic Portrait Animation with Diffusion Transformers AAAI 2025

Concept Conductor: Orchestrating Multiple Personalized Concepts in Text-to-Image Synthesis AAAI 2025

Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation AAAI 2025

Personalized Lip Reading: Adapting to Your Unique Lip Movements with Vision and Language AAAI 2025

Unified Coding for Both Human Perception and Generalized Machine Analytics with CLIP Supervision AAAI 2025

Action-Agnostic Point-Level Supervision for Temporal Action Detection AAAI 2025

ReMoGPT: Part-Level Retrieval-Augmented Motion-Language Models AAAI 2025

Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering AAAI 2025

Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective AAAI 2025

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP AAAI 2025

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model AAAI 2025

DVP-MVS: Synergize Depth-Edge and Visibility Prior for Multi-View Stereo AAAI 2025

Interpretable Face Anti-Spoofing: Enhancing Generalization with Multimodal Large Language Models AAAI 2025

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming AAAI 2025

Just a Few Glances: Open-Set Visual Perception with Image Prompt Paradigm AAAI 2025

Visual Perturbation for Text-Based Person Search AAAI 2025

Matching While Perceiving: Enhance Image Feature Matching with Applicable Semantic Amalgamation AAAI 2025

SVTformer: Spatial-View-Temporal Transformer for Multi-View 3D Human Pose Estimation AAAI 2025

Enhancing Multimodal Large Language Models Complex Reason via Similarity Computation AAAI 2025

Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues AAAI 2025