Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
PrivSV: Differentially Private Steering Vector for Large Language Models
AAAI 2026
K-12EduBench: A Benchmark for Evaluating Large Language Models’ Knowledge, Problem-Solving, and Educational Goal Cognition in K-12 Education
AAAI 2026
AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing
AAAI 2026
Safety Alignment of Large Language Models via Contrasting Safe and Harmful Distributions
AAAI 2026
ShadeEdit: A Utility-Preserving and Defense-Evasive Knowledge Manipulation Attack in Federated LLMs
AAAI 2026
Failures to Surface Harmful Contents in Video Large Language Models
AAAI 2026
Reference Recommendation Based Membership Inference Attack Against Hybrid-Based Recommender Systems
AAAI 2026
MedOmni-45°: A Safety–Performance Benchmark for Reasoning-Oriented LLMs in Medicine
AAAI 2026
SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
AAAI 2026
Probing Semantic Insensitivity for Inference-Time Backdoor Defense in Multimodal Large Language Model
AAAI 2026
Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems
AAAI 2026
MPMA: Preference Manipulation Attack Against Model Context Protocol
AAAI 2026
The Emotional Baby Is Truly Deadly: Does Your Multimodal Large Reasoning Model Have Emotional Flattery Towards Humans?
AAAI 2026
Silenced Biases: The Dark Side LLMs Learned to Refuse
AAAI 2026
Uncovering and Aligning Anomalous Attention Heads to Defend Against NLP Backdoor Attacks
AAAI 2026
Cost-Minimized Label-Flipping Poisoning Attack to LLM Alignment
AAAI 2026
Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment
AAAI 2026
Mitigating Self-Preference by Authorship Obfuscation
AAAI 2026
LieCraft: A Multi-Agent Framework for Evaluating Deceptive Capabilities in Language Models
AAAI 2026
Beyond I’m Sorry, I Can’t: Dissecting Large-Language-Model Refusal
AAAI 2026
Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models
AAAI 2026
Beyond Verdicts: Evaluating Language Model Moral Competence
AAAI 2026
Persistent Instability in LLM’s Personality Measurements: Effects of Scale, Reasoning, and Conversation History
AAAI 2026
Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding
AAAI 2026
When Proxy Agents Disagree, Do Humans Mirror? Manipulating Human Behavior in Moral Dilemmas Through Agents
AAAI 2026
<
1
…
9
10
11
…
119
>