Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Neural Policy Safety Verification via Predicate Abstraction: CEGAR
AAAI 2023
Understanding and Enhancing Robustness of Concept-Based Models
AAAI 2023
WAT: Improve the Worst-Class Robustness in Adversarial Training
AAAI 2023
Quantization-Aware Interval Bound Propagation for Training Certifiably Robust Quantized Neural Networks
AAAI 2023
A Semidefinite Relaxation Based Branch-and-Bound Method for Tight Neural Network Verification
AAAI 2023
Out-of-Distribution Detection Is Not All You Need
AAAI 2023
PatchNAS: Repairing DNNs in Deployment with Patched Network Architecture Search
AAAI 2023
Jailbroken: How Does LLM Safety Training Fail?
NIPS 2023
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting
NIPS 2023
Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots
NIPS 2023
TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models
NIPS 2023
On the Exploitability of Instruction Tuning
NIPS 2023
CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care
NIPS 2023
Collaborative Alignment of NLP Models
NIPS 2023
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models
NIPS 2023
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense
NIPS 2023
A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints
NIPS 2023
Iterative Reachability Estimation for Safe Reinforcement Learning
NIPS 2023
Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation
NIPS 2023
Survival Instinct in Offline Reinforcement Learning
NIPS 2023
Provably Safe Reinforcement Learning with Step-wise Violation Constraints
NIPS 2023
Sample-Efficient and Safe Deep Reinforcement Learning via Reset Deep Ensemble Agents
NIPS 2023
Behavior Alignment via Reward Function Optimization
NIPS 2023
BIRD: Generalizable Backdoor Detection and Removal for Deep Reinforcement Learning
NIPS 2023
Corruption-Robust Offline Reinforcement Learning with General Function Approximation
NIPS 2023
<
1
…
80
81
82
…
119
>