Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs
AAAI 2026
Efficient Verification and Falsification of ReLU Neural Barrier Certificates
AAAI 2026
Probing Semantic Insensitivity for Inference-Time Backdoor Defense in Multimodal Large Language Model
AAAI 2026
Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems
AAAI 2026
MCPTox: A Benchmark for Tool Poisoning on Real-World MCP Servers
AAAI 2026
ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models
AAAI 2026
MPMA: Preference Manipulation Attack Against Model Context Protocol
AAAI 2026
AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs
AAAI 2026
Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration
AAAI 2026
SafetyReminder: Reviving Delayed Safety Awareness of Vision-Language Models to Defend Against Jailbreak Attacks
AAAI 2026
Mitigating Content Effects on Reasoning in Language Models Through Fine-Grained Activation Steering
AAAI 2026
When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models
AAAI 2026
Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape
AAAI 2026
Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation
AAAI 2026
Benchmarking and Enhancing Rule Knowledge-Driven Reasoning of Large Language Models
AAAI 2026
Test-time Prompt Intervention
AAAI 2026
PrivSV: Differentially Private Steering Vector for Large Language Models
AAAI 2026
ShieldRAG: Safeguarding Retrieval-Augmented Generation from Untrusted Knowledge Bases
AAAI 2026
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
AAAI 2026
SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization
AAAI 2026
Bootstrapping LLMs via Preference-Based Policy Optimization
AAAI 2026
Backdooring Rationalization
AAAI 2026
Reinforce Trustworthiness in Multimodal Emotional Support System
AAAI 2026
ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models
AAAI 2026
LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation
AAAI 2026
<
1
…
5
6
7
…
119
>