Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs
EMNLP 2025
Pointing to a Llama and Call it a Camel: On the Sycophancy of Multimodal Large Language Models
EMNLP 2025
Unmasking Fake Careers: Detecting Machine-Generated Career Trajectories via Multi-layer Heterogeneous Graphs
EMNLP 2025
Understanding and Mitigating Overrefusal in LLMs from an Unveiling Perspective of Safety Decision Boundary
EMNLP 2025
MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models
EMNLP 2025
“I’ve Decided to Leak”: Probing Internals Behind Prompt Leakage Intents
EMNLP 2025
Nullspace Disentanglement for Red Teaming Language Models
EMNLP 2025
Investigating How Pre-training Data Leakage Affects Models’ Reproduction and Detection Capabilities
EMNLP 2025
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
EMNLP 2025
Hallucination Detection in LLMs Using Spectral Features of Attention Maps
EMNLP 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
EMNLP 2025
Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis
EMNLP 2025
Harmful Prompt Laundering: Jailbreaking LLMs with Abductive Styles and Symbolic Encoding
EMNLP 2025
Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
EMNLP 2025
Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making
EMNLP 2025
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
EMNLP 2025
Model Unlearning via Sparse Autoencoder Subspace Guided Projections
EMNLP 2025
TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
EMNLP 2025
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety
EMNLP 2025
Improving Large Language Model Safety with Contrastive Representation Learning
EMNLP 2025
Large Language Models Threaten Language’s Epistemic and Communicative Foundations
EMNLP 2025
How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation
EMNLP 2025
Do LLMs Behave as Claimed? Investigating How LLMs Follow Their Own Claims using Counterfactual Questions
EMNLP 2025
How Much Do LLMs Hallucinate across Languages? On Realistic Multilingual Estimation of LLM Hallucination
EMNLP 2025
Jailbreak LLMs through Internal Stance Manipulation
EMNLP 2025
<
1
…
27
28
29
…
119
>