Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Look Before You Leap: Enhance Attention and Vigilance Regarding Harmful Content with GuidelineLLM
AAAI 2025
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints
AAAI 2025
Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models
AAAI 2025
RepeatLeakage: Leak Prompts from Repeating as Large Language Model Is a Good Repeater
AAAI 2025
Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models
AAAI 2025
CL-Attack: Textual Backdoor Attacks via Cross-Lingual Triggers
AAAI 2025
LLM Agents Can Be Choice-Supportive Biased Evaluators: An Empirical Study
AAAI 2025
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
AAAI 2025
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models
AAAI 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
AAAI 2025
JailPO: A Novel Black-Box Jailbreak Framework via Preference Optimization Against Aligned LLMs
AAAI 2025
Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update
AAAI 2025
Strong Empowered and Aligned Weak Mastered Annotation for Weak-to-Strong Generalization
AAAI 2025
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
AAAI 2025
Exploring Intrinsic Alignments Within Text Corpus
AAAI 2025
Data with High and Consistent Preference Difference Are Better for Reward Model
AAAI 2025
Neurons to Words: A Novel Method for Automated Neural Network Interpretability and Alignment
AAAI 2025
SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
AAAI 2025
Align-Pro: A Principled Approach to Prompt Optimization for LLM Alignment
AAAI 2025
Towards a Theory of AI Personhood
AAAI 2025
Aligning Large Language Models for Faithful Integrity Against Opposing Argument
AAAI 2025
CALM: Curiosity-Driven Auditing for Large Language Models
AAAI 2025
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback
AAAI 2025
RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios?
AAAI 2025
Revisiting Early Detection of Sexual Predators via Turn-level Optimization
NAACL 2025
<
1
…
31
32
33
…
119
>