Chen Sun

82 papers · 2013–2026 · 15 conferences · across top CS/AI conferences

Achievements

+16 more ↓

🧭 Keyword Pioneer 🌍 Conference Polyglot (15) 🌉 Interdisciplinary Bridge 🌈 Renaissance Researcher (5) 🏃 Academic Marathon (13)

🌈 Renaissance Researcher (5) 🌉 Interdisciplinary Bridge 🌍 Conference Polyglot (15) 🌟 Keyword Trendsetter Combo (3) 🏠 Conference Loyalist (21) 🤝 Dynamic Duo (29) 🔬 Deep Specialist (15) 🧬 Topic Evolution 👑 Triple Crown ⚡ Prolific Year (9) 🗃️ Keyword Collector (320) ❓ The Questioner (10) 💎 Century Club (82) 📈 Trend Setter 🚀 Conference Pioneer 🔥 Unstoppable (14)

Conferences

CVPR (21) ICCV (14) ICLR (10) NIPS (10) ECCV (7) WACV (6) EMNLP (4) CONLL (2) ICML (2) ACL (1) CORL (1) IJCAI (1) INTERSPEECH (1) NAACL (1) NSDI (1)

Top co-authors

Cordelia Schmid (29) Anurag Arnab (10) Arsha Nagrani (8) Kevin Murphy (8) Shijie Wang (7) Ram Nevatia (6) Ellie Pavlick (6) Jiyang Gao (5) Rahul Sukthankar (5) Carl Vondrick (4)

Keywords

multimodal learning (8) video understanding (7) action recognition (6) self-supervised learning (5) zero-shot learning (5) representation learning (5) object detection (5) large language model (4) graph neural network (4) autonomous driving (3) transformer architecture (3) visual question answering (3) contrastive learning (3) convolutional neural network (3) attention mechanism (2) model compression (2) sequence modeling (2) video prediction (2) transfer learning (2) image retrieval (2)

Papers

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains WACV 2026 You May Speak Freely: Improving the Fine-Grained Visual Recognition Capabilities of Multimodal Large Language Models with Answer Extraction WACV 2026 Learning Visual Grounding from Generative Vision and Language Model WACV 2025 Fourier Head: Helping Large Language Models Learn Complex Probability Distributions ICLR 2025 Dense Video Object Captioning from Disjoint Supervision ICLR 2025 Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens ICLR 2025 Solving New Tasks by Adapting Internet Video Knowledge ICLR 2025 How new data permeates LLM knowledge and how to dilute it ICLR 2025 How Can Objects Help Video-Language Understanding? ICCV 2025 What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models ACL 2025 Motion Prompting: Controlling Video Generation with Motion Trajectories CVPR 2025 HyperFree: A Channel-adaptive and Tuning-free Foundation Model for Hyperspectral Remote Sensing Imagery CVPR 2025 What is an “Abstract Reasoner”? Revisiting Experiments and Arguments about Large Language Models CONLL 2025 MotiF: Making Text Count in Image Animation with Motion Focal Loss CVPR 2025 Potential Based Diffusion Motion Planning ICML 2024 Text-Aware Diffusion for Policy Learning NIPS 2024 Pixel-Aligned Language Model CVPR 2024 End-to-End Spatio-Temporal Action Localisation with Video Transformers CVPR 2024 Vamos: Versatile Action Models for Video Understanding ECCV 2024 EPO: Hierarchical LLM Agents with Environment Preference Optimization EMNLP 2024 AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos? ICLR 2024 Delta-AI: Local objectives for amortized inference in sparse graphical models ICLR 2024 Self-Correcting Self-Consuming Loops for Generative Model Training ICML 2024 Object-Centric Video Representation for Long-Term Action Anticipation WACV 2024 REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory CVPR 2023 Analyzing Modular Approaches for Visual Question Decomposition EMNLP 2023 Emergence of Abstract State Representations in Embodied Sequence Modeling EMNLP 2023 Goal-Conditioned Predictive Coding for Offline Reinforcement Learning NIPS 2023 Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL NIPS 2023 Deja Vu: Continual Model Generalization for Unseen Domains ICLR 2023 How Can Objects Help Action Recognition? CVPR 2023 AVIS: Autonomous Visual Information Seeking with Large Language Model Agent NIPS 2023 Does Visual Pretraining Help End-to-End Reasoning? NIPS 2023 Buffer-based End-to-end Request Event Monitoring in the Cloud NSDI 2022 Multiview Transformers for Video Recognition CVPR 2022 Trajectory balance: Improved credit assignment in GFlowNets NIPS 2022 Masking Modalities for Cross-Modal Video Retrieval WACV 2022 Learning Audio-Video Modalities from Image Captions ECCV 2022 TL;DW? Summarizing Instructional Videos with Task Relevance & Cross-Modal Saliency ECCV 2022 AVATAR: Unconstrained Audiovisual Speech Recognition INTERSPEECH 2022 Do Trajectories Encode Verb Meaning? NAACL 2022 Unified Graph Structured Models for Video Understanding ICCV 2021 ViViT: A Video Vision Transformer ICCV 2021 Composable Augmentation Encoding for Video Representation Learning ICCV 2021 DenseTNT: End-to-End Trajectory Prediction From Dense Goal Sets ICCV 2021 Episodic Transformer for Vision-and-Language Navigation ICCV 2021 Discrete-Valued Neural Communication NIPS 2021 HDMapGen: A Hierarchical Graph Generative Model of High Definition Maps CVPR 2021 Does Vision-and-Language Pretraining Improve Lexical Grounding? EMNLP 2021 Attention Bottlenecks for Multimodal Fusion NIPS 2021 Learning Temporal Dynamics From Cycles in Narrated Video ICCV 2021 Uncertainty-Aware Weakly Supervised Action Detection from Untrimmed Videos ECCV 2020 VectorNet: Encoding HD Maps and Agent Dynamics From Vectorized Representation CVPR 2020 Multi-modal Transformer for Video Retrieval ECCV 2020 Speech2Action: Cross-Modal Supervision for Action Recognition CVPR 2020 D3D: Distilled 3D Networks for Video Action Recognition WACV 2020 TNT: Target-driven Trajectory Prediction CORL 2020 What Makes for Good Views for Contrastive Learning? NIPS 2020 DNU: Deep Non-Local Unrolling for Computational Spectral Imaging CVPR 2020 Unsupervised learning of object structure and dynamics from videos NIPS 2019 Composing Text and Image for Image Retrieval - an Empirical Odyssey CVPR 2019 VideoBERT: A Joint Model for Video and Language Representation Learning ICCV 2019 Automated Pyramid Summarization Evaluation CONLL 2019 Relational Action Forecasting CVPR 2019 Stochastic Prediction of Multi-Agent Interactions from Partial Observations ICLR 2019 Hyperspectral Image Reconstruction Using a Deep Spatial-Spectral Prior CVPR 2019 Unsupervised Discovery of Parts, Structure, and Dynamics ICLR 2019 The INaturalist Species Classification and Detection Dataset CVPR 2018 Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification ECCV 2018 Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning CVPR 2018 AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions CVPR 2018 Actor-centric Relation Network ECCV 2018 TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals ICCV 2017 VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation ICCV 2017 Revisiting Unreasonable Effectiveness of Data in Deep Learning Era ICCV 2017 Instance-Level Label Propagation with Multi-Instance Learning IJCAI 2017 Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors CVPR 2017 TALL: Temporal Activity Localization via Language Query ICCV 2017 ProNet: Learning to Propose Object-Specific Boxes for Cascaded Neural Networks CVPR 2016 Automatic Concept Discovery From Parallel Text and Visual Corpora ICCV 2015 DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting CVPR 2014 ACTIVE: Activity Concept Transitions in Video Event Classification ICCV 2013