Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models NIPS 2024

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification NIPS 2024

Cooperation and Control in Delegation Games IJCAI 2024

BadFair: Backdoored Fairness Attacks with Group-conditioned Triggers EMNLP 2024

Zero-Resource Hallucination Prevention for Large Language Models EMNLP 2024

An Analysis of Tasks and Datasets in Peer Reviewing ACL 2024

Segmenting Watermarked Texts From Language Models NIPS 2024

NootNoot At SemEval-2024 Task 6: Hallucinations and Related Observable Overgeneration Mistakes Detection NAACL 2024

EAI: Emotional Decision-Making of LLMs in Strategic Games and Ethical Dilemmas NIPS 2024

Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling NIPS 2024

Unelicitable Backdoors via Cryptographic Transformer Circuits NIPS 2024

The Art of Saying No: Contextual Noncompliance in Language Models NIPS 2024

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models NIPS 2024

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition NIPS 2024

MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models NIPS 2024

Protecting Your LLMs with Information Bottleneck NIPS 2024

Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization NIPS 2024

Watermarking Makes Language Models Radioactive NIPS 2024

ProgressGym: Alignment with a Millennium of Moral Progress NIPS 2024

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space NIPS 2024

BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment NIPS 2024

LT-Defense: Searching-free Backdoor Defense via Exploiting the Long-tailed Effect NIPS 2024

ReMoDetect: Reward Models Recognize Aligned LLM's Generations NIPS 2024

Efficient Adversarial Training in LLMs with Continuous Attacks NIPS 2024

Self-contradictory reasoning evaluation and detection EMNLP 2024