Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs EMNLP 2025

Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models EMNLP 2025

Unmasking Fake Careers: Detecting Machine-Generated Career Trajectories via Multi-layer Heterogeneous Graphs EMNLP 2025

Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary EMNLP 2025

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models EMNLP 2025

“I’ve Decided to Leak”: Probing Internals Behind Prompt Leakage Intents EMNLP 2025

Nullspace Disentanglement for Red Teaming Language Models EMNLP 2025

Investigating How Pre-training Data Leakage Affects Models’ Reproduction and Detection Capabilities EMNLP 2025

NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks EMNLP 2025

Hallucination Detection in LLMs Using Spectral Features of Attention Maps EMNLP 2025

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender EMNLP 2025

Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis EMNLP 2025

Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding EMNLP 2025

Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study EMNLP 2025

Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making EMNLP 2025

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers EMNLP 2025

Model Unlearning via Sparse Autoencoder Subspace Guided Projections EMNLP 2025

TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent EMNLP 2025

MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety EMNLP 2025

Improving Large Language Model Safety with Contrastive Representation Learning EMNLP 2025

Large Language Models Threaten Language’s Epistemic and Communicative Foundations EMNLP 2025

How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation EMNLP 2025

Do LLMs Behave as Claimed? Investigating How LLMs Follow Their Own Claims using Counterfactual Questions EMNLP 2025

How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination EMNLP 2025

Jailbreak LLMs through Internal Stance Manipulation EMNLP 2025