Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Safe RAG by RAG: Untying the Bell That RAG Rang with the RAG Hand AAAI 2026

Query-Routed Activation Editing with Truth-hierarchical Preference Optimization AAAI 2026

Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment Through Latent Acoustic Pattern Triggers AAAI 2026

SafeNLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces AAAI 2026

BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models AAAI 2026

Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models AAAI 2026

WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety AAAI 2026

SOM Directions Are Better than One: Multi-Directional Refusal Suppression in Language Models AAAI 2026

MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies AAAI 2026

Mental Model-based Generation of Lies for Insider Threat Modeling AAAI 2026

W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search AAAI 2026

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models AAAI 2026

FaithLM: Towards Faithful Explanations for Large Language Models EACL 2026

DUP: Detection-guided Unlearning for Backdoor Purification in Language Models AAAI 2026

Model Editing as a Double-Edged Sword: Steering Agent Behavior Toward Beneficence or Harm AAAI 2026

Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization AAAI 2026

Proactive Constrained Policy Optimization with Preemptive Penalty AAAI 2026

Yours or Mine? Overwriting Attacks Against Neural Audio Watermarking AAAI 2026

VeriFlow: Modeling Distributions for Neural Network Verification AAAI 2026

Vulnerability-Aware Robust Multimodal Adversarial Training AAAI 2026

Beyond Training-time Poisoning: Component-level and Post-training Backdoors in Deep Reinforcement Learning AAAI 2026

Boosting the Robustness-Accuracy Trade-off of SNNs by Robust Temporal Self-Ensemble AAAI 2026

Dormant Backdoor: Weaponizing Model Finetuning for Feasible Backdoor Attacks Against Pretrained Models AAAI 2026

High Dimensional Distributed Gradient Descent with Arbitrary Number of Byzantine Attackers AAAI 2026

Towards Effective, Stealthy, and Persistent Backdoor Attacks Targeting Graph Foundation Models AAAI 2026