Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs
AAAI 2026
MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs
AAAI 2026
Preference Optimization via Contrastive Divergence: Your Policy Is Secretly an NLL Estimator
AAAI 2026
TWINFUZZ: Dual-Model Fuzzing for Robustness Generalization in Deep Learning
AAAI 2026
The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation
AAAI 2026
SMiLE: Provably Enforcing Global Relational Properties in Neural Networks
AAAI 2026
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
AAAI 2026
Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation
AAAI 2026
Silenced Biases: The Dark Side LLMs Learned to Refuse
AAAI 2026
Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks
AAAI 2026
K-12EduBench: A Benchmark for Evaluating Large Language Models’ Knowledge, Problem-Solving, and Educational Goal Cognition in K-12 Education
AAAI 2026
AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing
AAAI 2026
Interpretable Reward Model via Sparse Autoencoder
AAAI 2026
ShadeEdit: A Utility-Preserving and Defense-Evasive Knowledge Manipulation Attack in Federated LLMs
AAAI 2026
STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models
AAAI 2026
ExtendAttack: Attacking Servers of LRMs via Extending Reasoning
AAAI 2026
Failures to Surface Harmful Contents in Video Large Language Models
AAAI 2026
Reference Recommendation Based Membership Inference Attack Against Hybrid-Based Recommender Systems
AAAI 2026
Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models
AAAI 2026
FILTER: A Framework for Defending Against Backdoor Attacks in Vertical Federated Learning
AAAI 2026
Higher-Order Responsibility
AAAI 2026
SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
AAAI 2026
Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection
AAAI 2026
IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks
AAAI 2026
Learning Vision-Based Neural Network Controllers with Semi-Probabilistic Safety Guarantees
AAAI 2026
<
1
…
4
5
6
…
119
>