Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

HogVul: Black-box Adversarial Code Generation Framework Against LM-based Vulnerability Detectors AAAI 2026

Exploiting Synergistic Cognitive Biases to Bypass Safety in LLMs AAAI 2026

Disentangling Adversarial Prompts: A Semantic-Graph Defense for Robust LLM Security AAAI 2026

PhysPatch: A Physically Realizable and Transferable Adversarial Patch Attack for Multimodal Large Language Models-based Autonomous Driving Systems AAAI 2026

Causal, Strategic, and Combined Responsibility Attribution in Situation Calculus Concurrent Game Structures AAAI 2026

HAMLET4Fairness: Enhancing Fairness in AI Pipelines Through Human-Centered AutoML and Argumentation AAAI 2026

Control Illusion: The Failure of Instruction Hierarchies in Large Language Models AAAI 2026

Fact2Fiction: Targeted Poisoning Attack to Agentic Fact-checking System AAAI 2026

Bootstrapping LLMs via Preference-Based Policy Optimization AAAI 2026

EduGuardBench: A Holistic Benchmark for Evaluating the Pedagogical Fidelity and Adversarial Safety of LLMs as Simulated Teachers AAAI 2026

ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models AAAI 2026

LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation AAAI 2026

Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment Through Latent Acoustic Pattern Triggers AAAI 2026

SafeNLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces AAAI 2026

Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models AAAI 2026

WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety AAAI 2026

MirrorShield: Towards Dynamic Adaptive Defense Against Jailbreaks via Entropy-Guided Mirror Crafting AAAI 2026

MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies AAAI 2026

Perturb Your Data: Paraphrase-Guided Training Data Watermarking AAAI 2026

Fine-Tuned LLMs Know They Don’t Know: A Parameter-Efficient Approach to Recovering Honesty AAAI 2026

Enhancing Pre-training Data Detection in LLMs Through Discriminative and Symmetric Prefix Selection AAAI 2026

Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration AAAI 2026

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination AAAI 2026

Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation AAAI 2026

Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment AAAI 2026