Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation AAAI 2024

RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback CVPR 2024

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection AAAI 2024

Diff-BGM: A Diffusion Model for Video Background Music Generation CVPR 2024

From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations CVPR 2024

Non-autoregressive Sequence-to-Sequence Vision-Language Models CVPR 2024

By My Eyes: Grounding Multimodal Large Language Models with Sensor Data via Visual Prompting EMNLP 2024

VidLA: Video-Language Alignment at Scale CVPR 2024

Weakly Supervised Multimodal Affordance Grounding for Egocentric Images AAAI 2024

CCEdit: Creative and Controllable Video Editing via Diffusion Models CVPR 2024

UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity CVPR 2024

MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling AAAI 2024

Universal Segmentation at Arbitrary Granularity with Language Instruction CVPR 2024

Language Models as Black-Box Optimizers for Vision-Language Models CVPR 2024

VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding CVPR 2024

Image as a Language: Revisiting Scene Text Recognition via Balanced, Unified and Synchronized Vision-Language Reasoning Network AAAI 2024

Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use CVPR 2024

Text2Loc: 3D Point Cloud Localization from Natural Language CVPR 2024

Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer AAAI 2024

Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework CVPR 2024

Physical Property Understanding from Language-Embedded Feature Fields CVPR 2024

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding CVPR 2024

CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge Transfer AAAI 2024

Cycle-Consistency Learning for Captioning and Grounding AAAI 2024

Generating Human Motion in 3D Scenes from Text Descriptions CVPR 2024