Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Optimal Zero-Shot Detector for Multi-Armed Attacks
AISTATS 2024
Large Language Models Must Be Taught to Know What They Don’t Know
NIPS 2024
MoGU: A Framework for Enhancing Safety of LLMs While Preserving Their Usability
NIPS 2024
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
NIPS 2024
POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning
CVPR 2024
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders
ACL 2024
POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints
RSS 2024
DA3: A Distribution-Aware Adversarial Attack against Language Models
EMNLP 2024
Optimistic Verifiable Training by Controlling Hardware Nondeterminism
NIPS 2024
Real-Time Anomaly Detection and Reactive Planning with Large Language Models
RSS 2024
Axioms for AI Alignment from Human Feedback
NIPS 2024
Merging AI Incidents Research with Political Misinformation Research: Introducing the Political Deepfakes Incidents Database
AAAI 2024
ProMark: Proactive Diffusion Watermarking for Causal Attribution
CVPR 2024
MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection
CVPR 2024
Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining
CVPR 2024
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
CVPR 2024
Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
ACL 2024
The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models
ACL 2024
PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails
ACL 2024
SoFA: Shielded On-the-fly Alignment via Priority Rule Following
ACL 2024
A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
ACL 2024
UOR: Universal Backdoor Attacks on Pre-trained Language Models
ACL 2024
Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch
NIPS 2024
Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents
NIPS 2024
Can I trust You? LLMs as conversational agents
EACL 2024
<
1
…
65
66
67
…
119
>