conftrace_

Artificial Intelligence › Core AI ›

Multimodal Learning

13,057 papers

Papers per year

Papers

HGSFusion: Radar-Camera Fusion with Hybrid Generation and Synchronization for 3D Object Detection AAAI 2025

VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding AAAI 2025

GGS: Generalizable Gaussian Splatting for Lane Switching in Autonomous Driving AAAI 2025

Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking AAAI 2025

V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning AAAI 2025

Identity-Text Video Corpus Grounding AAAI 2025

EvoChart: A Benchmark and a Self-Training Approach Towards Real-World Chart Understanding AAAI 2025

VProChart: Answering Chart Question Through Visual Perception Alignment Agent and Programmatic Solution Reasoning AAAI 2025

Unleashing the Temporal-Spatial Reasoning Capacity of GPT for Training-Free Audio and Language Referenced Video Object Segmentation AAAI 2025

DAMPER: A Dual-Stage Medical Report Generation Framework with Coarse-Grained MeSH Alignment and Fine-Grained Hypergraph Matching AAAI 2025

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine AAAI 2025

SeFAR: Semi-supervised Fine-grained Action Recognition with Temporal Perturbation and Learning Stabilization AAAI 2025

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting AAAI 2025

QORT-Former: Query-optimized Real-time Transformer for Understanding Two Hands Manipulating Objects AAAI 2025

FastLGS: Speeding Up Language Embedded Gaussians with Feature Grid Mapping AAAI 2025

Orchestrating the Symphony of Prompt Distribution Learning for Human-Object Interaction Detection AAAI 2025

Granularity-Adaptive Spatial Evidence Tokenization for Video Question Answering AAAI 2025

LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement AAAI 2025

What Kind of Visual Tokens Do We Need? Training-Free Visual Token Pruning for Multi-Modal Large Language Models from the Perspective of Graph AAAI 2025

Pedestrian Attribute Recognition: A New Benchmark Dataset and a Large Language Model Augmented Framework AAAI 2025

Bridging the Semantic Granularity Gap Between Text and Frame Representations for Partially Relevant Video Retrieval AAAI 2025

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation AAAI 2025

MoDiTalker: Motion-Disentangled Diffusion Model for High-Fidelity Talking Head Generation AAAI 2025

ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning AAAI 2025

Multi-Modal Grounded Planning and Efficient Replanning for Learning Embodied Agents with a Few Examples AAAI 2025