model security

118 papers

Explore in graph

Co-occurring keywords

backdoor attack (377) adversarial attack (1599) adversarial learning (1592) large language model (12755) adversarial robustness (1335) adversarial defense (324) jailbreak attack (198) neural network (6616) trojan attack (23) data poisoning (128)

Papers

ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models ACL 2025

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection EMNLP 2025

Retracing the Past: LLMs Emit Training Data When They Get Lost EMNLP 2025

from Benign import Toxic: Jailbreaking the Language Model via Adversarial Metaphors ACL 2025

SPIRIT: Patching Speech Language Models against Jailbreak Attacks EMNLP 2025

SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models AAAI 2025

Shadow-Activated Backdoor Attacks on Multimodal Large Language Models ACL 2025

Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models AAAI 2025

Infighting in the Dark: Multi-Label Backdoor Attack in Federated Learning CVPR 2025

RepeatLeakage: Leak Prompts from Repeating as Large Language Model Is a Good Repeater AAAI 2025

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring NAACL 2025

Backdoor Attacks on Neural Networks via One-Bit Flip ICCV 2025

Seal Your Backdoor with Variational Defense ICCV 2025

Medical MLLM Is Vulnerable: Cross-Modality Jailbreak and Mismatched Attacks on Medical Multimodal Large Language Models AAAI 2025

SPD: Shallow Backdoor Protecting Deep Backdoor Against Backdoor Detection ICCV 2025

Safety in Large Reasoning Models: A Survey EMNLP 2025

Obliviate: Neutralizing Task-agnostic Backdoors within the Parameter-efficient Fine-tuning Paradigm NAACL 2025

Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications ACL 2025

Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks NAACL 2025

PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining NIPS 2024

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning CVPR 2024

Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models CVPR 2024

Unelicitable Backdoors via Cryptographic Transformer Circuits NIPS 2024

Taylor Unswift: Secured Weight Release for Large Language Models via Taylor Expansion EMNLP 2024

Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models NIPS 2024