Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Intrinsic Barriers and Practical Pathways for Human–AI Alignment: An Agreement-Based Complexity Analysis AAAI 2026

Realist and Pluralist Conceptions of Intelligence and Their Implications on AI Research AAAI 2026

AdvBDGen: A Robust Framework for Generating Adaptive and Stealthy Backdoors in LLM Alignment AAAI 2026

Beyond I’m Sorry, I Can’t: Dissecting Large-Language-Model Refusal AAAI 2026

Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models AAAI 2026

Confirmation Bias: A Challenge for Scalable Oversight AAAI 2026

Detecting Compute Structuring in AI Governance Is Likely Feasible AAAI 2026

Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training AAAI 2026

Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems AAAI 2026

Safe Multi-agent Reinforcement Learning with Natural Language Constraints AAAI 2026

Designing Incident Reporting Systems for Harms from General-Purpose AI AAAI 2026

HumorReject: Decoupling LLM Safety from Refusal Prefix via a Little Humor AAAI 2026

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF AAAI 2026

Multi-Faceted Attack: Exposing Cross-Model Vulnerabilities in Defense-Equipped Vision-Language Models AAAI 2026

Composable Assurance for AI Alignment: A Framework for Propagating Formal Safety Properties Through MLOps AAAI 2026

When Proxy Agents Disagree, Do Humans Mirror? Manipulating Human Behavior in Moral Dilemmas Through Agents AAAI 2026

Beta Distribution Learning for Reliable Roadway Crash Risk Assessment AAAI 2026

MHB: Medical Hallucination Benchmark for Large Language Models in Complex Clinical Tasks AAAI 2026

Should You Use LLMs to Simulate Opinions? Quality Checks for Early-Stage Deliberation AAAI 2026

Hashed Watermark as a Filter: A Unified Defense Against Forging and Overwriting Attacks in Neural Network Watermarking AAAI 2026

Reason2Attack: Jailbreaking Text-to-Image Models via LLM Reasoning AAAI 2026

Consensus Learning with Multi-Party Perturbation Triggers for Secure Model Access AAAI 2026

Probabilistic Safety Verification of Neural Policies via Predicate Abstraction AAAI 2026

AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models AAAI 2026

MoralReason: Generalizable Moral Decision Alignment for LLM Agents Using Reasoning-Level Reinforcement Learning AAAI 2026