Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Dynamic Evaluation for Oversensitivity in LLMs EMNLP 2025

“What’s Up, Doc?”: Analyzing How Users Seek Health Information in Large-Scale Conversational AI Datasets EMNLP 2025

Measuring Sycophancy of Language Models in Multi-turn Dialogues EMNLP 2025

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models EMNLP 2025

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation EMNLP 2025

The Hallucination Tax of Reinforcement Finetuning EMNLP 2025

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs EMNLP 2025

Dagger Behind Smile: Fool LLMs with a Happy Ending Story EMNLP 2025

English as Defense Proxy: Mitigating Multilingual Jailbreak via Eliciting English Safety Knowledge EMNLP 2025

sudoLLM: On Multi-role Alignment of Language Models EMNLP 2025

Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language EMNLP 2025

Intrinsic Test of Unlearning Using Parametric Knowledge Traces EMNLP 2025

Anecdoctoring: Automated Red-Teaming Across Language and Place EMNLP 2025

Stimulate the Critical Thinking of LLMs via Debiasing Discussion EMNLP 2025

Unlearning vs. Obfuscation: Are We Truly Removing Knowledge? EMNLP 2025

TopicAttack: An Indirect Prompt Injection Attack via Topic Transition EMNLP 2025

Context-Aware Membership Inference Attacks against Pre-trained Large Language Models EMNLP 2025

Reasoning Model Unlearning: Forgetting Traces, Not Just Answers, While Preserving Reasoning Skills EMNLP 2025

Exploring the Impact of Personality Traits on LLM Bias and Toxicity EMNLP 2025

DSCD: Large Language Model Detoxification with Self-Constrained Decoding EMNLP 2025

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience EMNLP 2025

Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens EMNLP 2025

Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations EMNLP 2025

SafeScientist: Enhancing AI Scientist Safety for Risk-Aware Scientific Discovery EMNLP 2025

WebInject: Prompt Injection Attack to Web Agents EMNLP 2025