Difei Gao
23 papers · 2020–2025 · 9 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+10 more ↓ Show less ↑
π Cross-Pollinator (14) π§ Keyword Pioneer π Academic Marathon (5) π Conference Polyglot (9) π Renaissance Researcher (7)
π
Renaissance Researcher
(7)
π
Interdisciplinary Bridge
πΊοΈ
Taxonomy Completionist
(42)
π¬
Deep Specialist
(11)
π€
Dynamic Duo
(20)
π§¬
Topic Evolution
β‘
Prolific Year
(7)
ποΈ
Keyword Collector
(99)
π₯
Unstoppable
(6)
π
Century Club
(23)
Conferences
CVPR (7)
ICCV (4)
ECCV (3)
NIPS (3)
EMNLP (2)
AAAI (1)
ACL (1)
ICLR (1)
IJCAI (1)
Top co-authors
Keywords
multimodal learning
(7)
multi-modal learning
(4)
video understanding
(4)
video question answering
(4)
visual question answering
(3)
graphical user interface
(3)
action recognition
(2)
vision transformer
(2)
continual learning
(2)
benchmark evaluation
(2)
large language model
(2)
video temporal grounding
(2)
egocentric vision
(2)
contrastive learning
(2)
instructional video
(2)
knowledge transfer
(1)
attention mechanism
(1)
video captioning
(1)
curriculum learning
(1)
scene understanding
(1)
Papers
Grounding Multimodal Large Language Model in GUI World
ICLR 2025
Factorized Learning for Temporally Grounded Video-Language Models
ICCV 2025
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
CVPR 2025
Delocate: Detection and Localization for Deepfake Videos with Randomly-Located Tampered Traces
IJCAI 2024
VideoGUI: A Benchmark for GUI Automation from Instructional Videos
NIPS 2024
LOVA3: Learning to Visual Question Answering, Asking and Assessment
NIPS 2024
VideoLLM-online: Online Video Large Language Model for Streaming Video
CVPR 2024
ViT-Lens: Towards Omni-modal Representations
CVPR 2024
AssistGUI: Task-Oriented PC Graphical User Interface Automation
CVPR 2024
Learning Video Context as Interleaved Multimodal Sequences
ECCV 2024
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
ACL 2023
GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations
EMNLP 2023
Learning to Learn: How to Continuously Teach Humans and Machines
ICCV 2023
Affordance Grounding From Demonstration Video To Target Image
CVPR 2023
MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering
CVPR 2023
UniVTG: Towards Unified Video-Language Temporal Grounding
ICCV 2023
Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
AAAI 2023
"GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval"
ECCV 2022
AssistQ: Affordance-Centric Question-Driven Task Completion for Egocentric Assistant
ECCV 2022
Egocentric Video-Language Pretraining
NIPS 2022
AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant
EMNLP 2022
Env-QA: A Video Question Answering Benchmark for Comprehensive Understanding of Dynamic Environments
ICCV 2021
Multi-Modal Graph Neural Network for Joint Reasoning on Vision and Scene Text
CVPR 2020