Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations CVPR 2024

Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles CVPR 2024

Mask Grounding for Referring Image Segmentation CVPR 2024

Frequency Spectrum Is More Effective for Multimodal Representation and Fusion: A Multimodal Spectrum Rumor Detector AAAI 2024

Non-autoregressive Sequence-to-Sequence Vision-Language Models CVPR 2024

A Hierarchical Network for Multimodal Document-Level Relation Extraction AAAI 2024

UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity CVPR 2024

Generating Human Motion in 3D Scenes from Text Descriptions CVPR 2024

NTO3D: Neural Target Object 3D Reconstruction with Segment Anything CVPR 2024

SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples CVPR 2024

BOK-VQA: Bilingual outside Knowledge-Based Visual Question Answering via Graph Representation Pretraining AAAI 2024

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation CVPR 2024

Debiasing Multimodal Sarcasm Detection with Contrastive Learning AAAI 2024

Detours for Navigating Instructional Videos CVPR 2024

On the Robustness of Large Multimodal Models Against Image Adversarial Attacks CVPR 2024

Text2Loc: 3D Point Cloud Localization from Natural Language CVPR 2024

Cross-Modal and Uni-Modal Soft-Label Alignment for Image-Text Retrieval AAAI 2024

RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation CVPR 2024

Detecting and Preventing Hallucinations in Large Vision Language Models AAAI 2024

Can I Trust Your Answer? Visually Grounded Video Question Answering CVPR 2024

Can't Make an Omelette Without Breaking Some Eggs: Plausible Action Anticipation Using Large Video-Language Models CVPR 2024

Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions CVPR 2024

Unified Language-driven Zero-shot Domain Adaptation CVPR 2024

MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation CVPR 2024

MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis AAAI 2024