Difei Gao

23 papers · 2020–2025 · 9 conferences · across top CS/AI conferences

Achievements

+10 more ↓

🐝 Cross-Pollinator (14) 🧭 Keyword Pioneer 🏃 Academic Marathon (5) 🌍 Conference Polyglot (9) 🌈 Renaissance Researcher (7)

🌈 Renaissance Researcher (7) 🌉 Interdisciplinary Bridge 🗺️ Taxonomy Completionist (42) 🔬 Deep Specialist (11) 🤝 Dynamic Duo (20) 🧬 Topic Evolution ⚡ Prolific Year (7) 🗃️ Keyword Collector (99) 🔥 Unstoppable (6) 💎 Century Club (23)

Conferences

CVPR (7) ICCV (4) ECCV (3) NIPS (3) EMNLP (2) AAAI (1) ACL (1) ICLR (1) IJCAI (1)

Top co-authors

Mike Zheng Shou (20) Kevin Qinghong Lin (7) Joya Chen (6) Weixian Lei (5) Stan Weixian Lei (4) Dongxing Mao (4) Yuxuan Wang (3) Zechen Bai (3) Lei Ji (3) Ziteng Gao (2)

Keywords

multimodal learning (7) multi-modal learning (4) video understanding (4) video question answering (4) visual question answering (3) graphical user interface (3) action recognition (2) vision transformer (2) continual learning (2) benchmark evaluation (2) large language model (2) video temporal grounding (2) egocentric vision (2) contrastive learning (2) instructional video (2) knowledge transfer (1) attention mechanism (1) video captioning (1) curriculum learning (1) scene understanding (1)

Papers

Grounding Multimodal Large Language Model in GUI World ICLR 2025 Factorized Learning for Temporally Grounded Video-Language Models ICCV 2025 ShowUI: One Vision-Language-Action Model for GUI Visual Agent CVPR 2025 Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces IJCAI 2024 VideoGUI: A Benchmark for GUI Automation from Instructional Videos NIPS 2024 LOVA3: Learning to Visual Question Answering, Asking and Assessment NIPS 2024 VideoLLM-online: Online Video Large Language Model for Streaming Video CVPR 2024 ViT-Lens: Towards Omni-modal Representations CVPR 2024 AssistGUI: Task-Oriented PC Graphical User Interface Automation CVPR 2024 Learning Video Context as Interleaved Multimodal Sequences ECCV 2024 CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding ACL 2023 GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations EMNLP 2023 Learning to Learn: How to Continuously Teach Humans and Machines ICCV 2023 Affordance Grounding From Demonstration Video To Target Image CVPR 2023 MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering CVPR 2023 UniVTG: Towards Unified Video-Language Temporal Grounding ICCV 2023 Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task AAAI 2023 "GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval" ECCV 2022 AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant ECCV 2022 Egocentric Video-Language Pretraining NIPS 2022 AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant EMNLP 2022 Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments ICCV 2021 Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text CVPR 2020