Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Safe Exploration in Reinforcement Learning: A Generalized Formulation and Algorithms
NIPS 2023
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
NIPS 2023
Safety Gymnasium: A Unified Safe Reinforcement Learning Benchmark
NIPS 2023
Constraint-Conditioned Policy Optimization for Versatile Safe Reinforcement Learning
NIPS 2023
Learning Shared Safety Constraints from Multi-task Demonstrations
NIPS 2023
Language Model Alignment with Elastic Reset
NIPS 2023
Honesty Is the Best Policy: Defining and Mitigating AI Deception
NIPS 2023
Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning
NIPS 2023
Natural Actor-Critic for Robust Reinforcement Learning with Function Approximation
NIPS 2023
Red Teaming Deep Neural Networks with Feature Synthesis Tools
NIPS 2023
Provably Bounding Neural Network Preimages
NIPS 2023
Connecting Certified and Adversarial Training
NIPS 2023
Multi-scale Diffusion Denoised Smoothing
NIPS 2023
CamoPatch: An Evolutionary Strategy for Generating Camoflauged Adversarial Patches
NIPS 2023
FedGame: A Game-Theoretic Defense against Backdoor Attacks in Federated Learning
NIPS 2023
Enhancing Adversarial Robustness via Score-Based Optimization
NIPS 2023
Optimizing over trained GNNs via symmetry breaking
NIPS 2023
Towards Evaluating Transfer-based Attacks Systematically, Practically, and Fairly
NIPS 2023
Improving Robustness with Adaptive Weight Decay
NIPS 2023
Static and Sequential Malicious Attacks in the Context of Selective Forgetting
NIPS 2023
Robust Bayesian Satisficing
NIPS 2023
Black-box Backdoor Defense via Zero-shot Image Purification
NIPS 2023
Defending Pre-trained Language Models as Few-shot Learners against Backdoor Attacks
NIPS 2023
Certification of Distributional Individual Fairness
NIPS 2023
Attacks on Online Learners: a Teacher-Student Analysis
NIPS 2023
<
1
…
81
82
83
…
119
>