Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models
NIPS 2024
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
NIPS 2024
Cooperation and Control in Delegation Games
IJCAI 2024
BadFair: Backdoored Fairness Attacks with Group-conditioned Triggers
EMNLP 2024
Zero-Resource Hallucination Prevention for Large Language Models
EMNLP 2024
An Analysis of Tasks and Datasets in Peer Reviewing
ACL 2024
Segmenting Watermarked Texts From Language Models
NIPS 2024
NootNoot At SemEval-2024 Task 6: Hallucinations and Related Observable Overgeneration Mistakes Detection
NAACL 2024
EAI: Emotional Decision-Making of LLMs in Strategic Games and Ethical Dilemmas
NIPS 2024
Accelerating Greedy Coordinate Gradient and General Prompt Optimization via Probe Sampling
NIPS 2024
Unelicitable Backdoors via Cryptographic Transformer Circuits
NIPS 2024
The Art of Saying No: Contextual Noncompliance in Language Models
NIPS 2024
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
NIPS 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
NIPS 2024
MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models
NIPS 2024
Protecting Your LLMs with Information Bottleneck
NIPS 2024
Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization
NIPS 2024
Watermarking Makes Language Models Radioactive
NIPS 2024
ProgressGym: Alignment with a Millennium of Moral Progress
NIPS 2024
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
NIPS 2024
BackdoorAlign: Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
NIPS 2024
LT-Defense: Searching-free Backdoor Defense via Exploiting the Long-tailed Effect
NIPS 2024
ReMoDetect: Reward Models Recognize Aligned LLM's Generations
NIPS 2024
Efficient Adversarial Training in LLMs with Continuous Attacks
NIPS 2024
Self-contradictory reasoning evaluation and detection
EMNLP 2024
<
1
…
77
78
79
…
119
>