Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Safe RAG by RAG: Untying the Bell That RAG Rang with the RAG Hand
AAAI 2026
Query-Routed Activation Editing with Truth-hierarchical Preference Optimization
AAAI 2026
Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment Through Latent Acoustic Pattern Triggers
AAAI 2026
SafeNLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
AAAI 2026
BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models
AAAI 2026
Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models
AAAI 2026
WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety
AAAI 2026
SOM Directions Are Better than One: Multi-Directional Refusal Suppression in Language Models
AAAI 2026
MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies
AAAI 2026
Mental Model-based Generation of Lies for Insider Threat Modeling
AAAI 2026
W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search
AAAI 2026
Control Illusion: The Failure of Instruction Hierarchies in Large Language Models
AAAI 2026
FaithLM: Towards Faithful Explanations for Large Language Models
EACL 2026
DUP: Detection-guided Unlearning for Backdoor Purification in Language Models
AAAI 2026
Model Editing as a Double-Edged Sword: Steering Agent Behavior Toward Beneficence or Harm
AAAI 2026
Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization
AAAI 2026
Proactive Constrained Policy Optimization with Preemptive Penalty
AAAI 2026
Yours or Mine? Overwriting Attacks Against Neural Audio Watermarking
AAAI 2026
VeriFlow: Modeling Distributions for Neural Network Verification
AAAI 2026
Vulnerability-Aware Robust Multimodal Adversarial Training
AAAI 2026
Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement Learning
AAAI 2026
Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble
AAAI 2026
Dormant Backdoor: Weaponizing Model Finetuning for Feasible Backdoor Attacks Against Pretrained Models
AAAI 2026
High Dimensional Distributed Gradient Descent with Arbitrary Number of Byzantine Attackers
AAAI 2026
Towards Effective, Stealthy, and Persistent Backdoor Attacks Targeting Graph Foundation Models
AAAI 2026
<
1
…
6
7
8
…
119
>