← Architectures

Deep Learning › Architectures ›

Transformers

9294 directly classified papers

Papers per year

Papers

Learning Occlusion-Robust Vision Transformers for Real-Time UAV Tracking CVPR 2025

Vision-Language Embodiment for Monocular Depth Estimation CVPR 2025

TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation CVPR 2025

ArtFormer: Controllable Generation of Diverse 3D Articulated Objects CVPR 2025

MAP: Unleashing Hybrid Mamba-Transformer Vision Backbone's Potential with Masked Autoregressive Pretraining CVPR 2025

ABC-Former: Auxiliary Bimodal Cross-domain Transformer with Interactive Channel Attention for White Balance CVPR 2025

Accelerating Diffusion Transformer via Increment-Calibrated Caching with Channel-Aware Singular Value Decomposition CVPR 2025

Open Ad-hoc Categorization with Contextualized Feature Learning CVPR 2025

TSP-Mamba: The Travelling Salesman Problem Meets Mamba for Image Super-resolution and Beyond CVPR 2025

BOOTPLACE: Bootstrapped Object Placement with Detection Transformers CVPR 2025

Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models CVPR 2025

Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization EMNLP 2025

Cluster Based Heterogeneous Federated Foundation Model Adaptation and Fine-Tuning AAAI 2025

Sequence Accumulation and Beyond: Infinite Context Length on Single GPU and Large Clusters AAAI 2025

Detecting Legal Citations in United Kingdom Court Judgments EMNLP 2025

Conan-Embedding-v2: Training an LLM from Scratch for Text Embeddings EMNLP 2025

VRoPE: Rotary Position Embedding for Video Large Language Models EMNLP 2025

WST: Wavelet-Based Multi-scale Tuning for Visual Transfer Learning AAAI 2025

Boosting Multi-modal Keyphrase Prediction with Dynamic Chain-of-Thought in Vision-Language Models EMNLP 2025

UniMuMo: Unified Text, Music, and Motion Generation AAAI 2025

SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration CVPR 2025

Can We Get Rid of Handcrafted Feature Extractors? SparseViT: Nonsemantics-Centered, Parameter-Efficient Image Manipulation Localization Through Spare-Coding Transformer AAAI 2025

mmFAS: Multimodal Face Anti-Spoofing Using Multi-Level Alignment and Switch-Attention Fusion AAAI 2025

IMAGDressing-v1: Customizable Virtual Dressing AAAI 2025

A Generative Pre-Trained Language Model for Channel Prediction in Wireless Communications Systems EMNLP 2025