Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

A Course Correction in Steerability Evaluation: Revealing Miscalibration and Side Effects in LLMs AAAI 2026

MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs AAAI 2026

Preference Optimization via Contrastive Divergence: Your Policy Is Secretly an NLL Estimator AAAI 2026

TWINFUZZ: Dual-Model Fuzzing for Robustness Generalization in Deep Learning AAAI 2026

The Alignment Game: A Theory of Long-Horizon Alignment Through Recursive Curation AAAI 2026

SMiLE: Provably Enforcing Global Relational Properties in Neural Networks AAAI 2026

AlignTree: Efficient Defense Against LLM Jailbreak Attacks AAAI 2026

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation AAAI 2026

Silenced Biases: The Dark Side LLMs Learned to Refuse AAAI 2026

Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks AAAI 2026

K-12EduBench: A Benchmark for Evaluating Large Language Models’ Knowledge, Problem-Solving, and Educational Goal Cognition in K-12 Education AAAI 2026

AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing AAAI 2026

Interpretable Reward Model via Sparse Autoencoder AAAI 2026

ShadeEdit: A Utility-Preserving and Defense-Evasive Knowledge Manipulation Attack in Federated LLMs AAAI 2026

STaR: Sensitive Trajectory Regulation for Unlearning in Large Reasoning Models AAAI 2026

ExtendAttack: Attacking Servers of LRMs via Extending Reasoning AAAI 2026

Failures to Surface Harmful Contents in Video Large Language Models AAAI 2026

Reference Recommendation Based Membership Inference Attack Against Hybrid-Based Recommender Systems AAAI 2026

Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models AAAI 2026

FILTER: A Framework for Defending Against Backdoor Attacks in Vertical Federated Learning AAAI 2026

Higher-Order Responsibility AAAI 2026

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation AAAI 2026

Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection AAAI 2026

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks AAAI 2026

Learning Vision-Based Neural Network Controllers with Semi-Probabilistic Safety Guarantees AAAI 2026