Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs AAAI 2026

Efficient Verification and Falsification of ReLU Neural Barrier Certificates AAAI 2026

Probing Semantic Insensitivity for Inference-Time Backdoor Defense in Multimodal Large Language Model AAAI 2026

Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems AAAI 2026

MCPTox: A Benchmark for Tool Poisoning on Real-World MCP Servers AAAI 2026

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models AAAI 2026

MPMA: Preference Manipulation Attack Against Model Context Protocol AAAI 2026

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs AAAI 2026

Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration AAAI 2026

SafetyReminder: Reviving Delayed Safety Awareness of Vision-Language Models to Defend Against Jailbreak Attacks AAAI 2026

Mitigating Content Effects on Reasoning in Language Models Through Fine-Grained Activation Steering AAAI 2026

When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models AAAI 2026

Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape AAAI 2026

Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation AAAI 2026

Benchmarking and Enhancing Rule Knowledge-Driven Reasoning of Large Language Models AAAI 2026

Test-time Prompt Intervention AAAI 2026

PrivSV: Differentially Private Steering Vector for Large Language Models AAAI 2026

ShieldRAG: Safeguarding Retrieval-Augmented Generation from Untrusted Knowledge Bases AAAI 2026

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin AAAI 2026

SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization AAAI 2026

Bootstrapping LLMs via Preference-Based Policy Optimization AAAI 2026

Backdooring Rationalization AAAI 2026

Reinforce Trustworthiness in Multimodal Emotional Support System AAAI 2026

ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models AAAI 2026

LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation AAAI 2026