Artificial Intelligence › Core AI ›

Multi-Modal Learning

1457 directly classified papers

Papers per year

Papers

Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline CVPR 2024

GROUNDHOG: Grounding Large Language Models to Holistic Segmentation CVPR 2024

iKUN: Speak to Trackers without Retraining CVPR 2024

WebVLN: Vision-and-Language Navigation on Websites AAAI 2024

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language CVPR 2024

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding CVPR 2024

Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling CVPR 2024

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination EMNLP 2024

LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding CVPR 2024

VIXEN: Visual Text Comparison Network for Image Difference Captioning AAAI 2024

UniHuman: A Unified Model For Editing Human Images in the Wild CVPR 2024

Analyzing Key Factors Influencing Emotion Prediction Performance of VLLMs in Conversational Contexts EMNLP 2024

Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content Counterfactually AAAI 2024

SEER: Backdoor Detection for Vision-Language Models through Searching Target Text and Image Trigger Jointly AAAI 2024

Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow CVPR 2024

GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities EMNLP 2024

Discovering Syntactic Interaction Clues for Human-Object Interaction Detection CVPR 2024

Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld CVPR 2024

Inter-X: Towards Versatile Human-Human Interaction Analysis CVPR 2024

Revisiting motion information for RGB-Event tracking with MOT philosophy NIPS 2024

Concept-skill Transferability-based Data Selection for Large Vision-Language Models EMNLP 2024

Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles CVPR 2024

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups CVPR 2024

Beyond Embeddings: The Promise of Visual Table in Visual Reasoning EMNLP 2024

Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning EMNLP 2024