← Optimization & Theory

Deep Learning › Optimization & Theory ›

Model Compression

1674 directly classified papers

Papers per year

Papers

Cherry on Top: Parameter Heterogeneity and Quantization in Large Language Models NIPS 2024

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding ACL 2024

LLM in a flash: Efficient Large Language Model Inference with Limited Memory ACL 2024

UniPTS: A Unified Framework for Proficient Post-Training Sparsity CVPR 2024

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization NIPS 2024

ChunkAttention: Efficient Self-Attention with Prefix-Aware KV Cache and Two-Phase Partition ACL 2024

On the Impact of Calibration Data in Post-training Quantization and Pruning ACL 2024

USDN: A Unified Sample-Wise Dynamic Network With Mixed-Precision and Early-Exit WACV 2024

PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression NIPS 2024

BOLD: Boolean Logic Deep Learning NIPS 2024

Dodo: Dynamic Contextual Compression for Decoder-only LMs ACL 2024

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space ACL 2024

Learning To Compose SuperWeights for Neural Parameter Allocation Search WACV 2024

NACL: A General and Effective KV Cache Eviction Framework for LLM at Inference Time ACL 2024

SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget ACL 2024

PartialFormer: Modeling Part Instead of Whole for Machine Translation ACL 2024

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models CVPR 2024

WRP: Weight Recover Prune for Structured Sparsity ACL 2024

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models ACL 2024

SparseFlow: Accelerating Transformers by Sparsifying Information Flows ACL 2024

MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning ACL 2024

MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection CVPR 2024

FM-Delta: Lossless Compression for Storing Massive Fine-tuned Foundation Models NIPS 2024

QDyLoRA: Quantized Dynamic Low-Rank Adaptation for Efficient Large Language Model Tuning EMNLP 2024

Parameter Competition Balancing for Model Merging NIPS 2024