Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Fairness Shields: Safeguarding against Biased Decision Makers
AAAI 2025
Probabilistic Shielding for Safe Reinforcement Learning
AAAI 2025
Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning
AAAI 2025
Stop Diverse OOD Attacks: Knowledge Ensemble for Reliable Defense
AAAI 2025
The Partially Observable Off-Switch Game
AAAI 2025
Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems
AAAI 2025
SEAL: Systematic Error Analysis for Value ALignment
AAAI 2025
ME: Modelling Ethical Values for Value Alignment
AAAI 2025
Leveraging Human Input to Enable Robust, Interactive, and Aligned AI Systems
AAAI 2025
Axioms for AI Alignment from Human Feedback
AAAI 2025
An Evolutionary Perspective on AI Alignment (Student Abstract)
AAAI 2025
RESF: Regularized-Entropy-Sensitive Fingerprinting for Black-Box Tamper Detection of Large Language Models
EMNLP 2025
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
EMNLP 2025
TombRaider: Entering the Vault of History to Jailbreak Large Language Models
EMNLP 2025
SEPS: A Separability Measure for Robust Unlearning in LLMs
EMNLP 2025
MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds
EMNLP 2025
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging
EMNLP 2025
Pragmatic Inference Chain (PIC) Improving LLMs’ Reasoning of Authentic Implicit Toxic Language
EMNLP 2025
Detoxifying Large Language Models via the Diversity of Toxic Samples
EMNLP 2025
VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models
EMNLP 2025
Rethinking Backdoor Detection Evaluation for Language Models
EMNLP 2025
DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models
EMNLP 2025
EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint
EMNLP 2025
Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models
EMNLP 2025
Attacks by Content: Automated Fact-checking is an AI Security Issue
EMNLP 2025
<
1
…
51
52
53
…
119
>