Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Dynamic Evaluation for Oversensitivity in LLMs
EMNLP 2025
“What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets
EMNLP 2025
Measuring Sycophancy of Language Models in Multi-turn Dialogues
EMNLP 2025
Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
EMNLP 2025
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
EMNLP 2025
The Hallucination Tax of Reinforcement Finetuning
EMNLP 2025
Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
EMNLP 2025
Dagger Behind Smile: Fool LLMs with a Happy Ending Story
EMNLP 2025
English as Defense Proxy: Mitigating Multilingual Jailbreak via Eliciting English Safety Knowledge
EMNLP 2025
sudoLLM: On Multi-role Alignment of Language Models
EMNLP 2025
Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language
EMNLP 2025
Intrinsic Test of Unlearning Using Parametric Knowledge Traces
EMNLP 2025
Anecdoctoring: Automated Red-Teaming Across Language and Place
EMNLP 2025
Stimulate the Critical Thinking of LLMs via Debiasing Discussion
EMNLP 2025
Unlearning vs. Obfuscation: Are We Truly Removing Knowledge?
EMNLP 2025
TopicAttack: An Indirect Prompt Injection Attack via Topic Transition
EMNLP 2025
Context-Aware Membership Inference Attacks against Pre-trained Large Language Models
EMNLP 2025
Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills
EMNLP 2025
Exploring the Impact of Personality Traits on LLM Bias and Toxicity
EMNLP 2025
DSCD: Large Language Model Detoxification with Self-Constrained Decoding
EMNLP 2025
Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience
EMNLP 2025
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens
EMNLP 2025
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations
EMNLP 2025
SafeScientist: Enhancing AI Scientist Safety for Risk-Aware Scientific Discovery
EMNLP 2025
WebInject: Prompt Injection Attack to Web Agents
EMNLP 2025
<
1
…
45
46
47
…
119
>