Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis
EMNLP 2025
Where Confabulation Lives: Latent Feature Discovery in LLMs
EMNLP 2025
Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework
EMNLP 2025
Are Language Models Consequentialist or Deontological Moral Reasoners?
EMNLP 2025
Pluralistic Alignment for Healthcare: A Role-Driven Framework
EMNLP 2025
Are Stereotypes Leading LLMs’ Zero-Shot Stance Detection ?
EMNLP 2025
TempParaphraser: “Heating Up” Text to Evade AI-Text Detection through Paraphrasing
EMNLP 2025
Language Models Identify Ambiguities and Exploit Loopholes
EMNLP 2025
Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions
EMNLP 2025
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
EMNLP 2025
A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs
EMNLP 2025
ROBOTO2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment
EMNLP 2025
SAGE: A Generic Framework for LLM Safety Evaluation
EMNLP 2025
AutoCVSS: Assessing the Performance of LLMs for Automated Software Vulnerability Scoring
EMNLP 2025
Towards Enforcing Company Policy Adherence in Agentic Workflows
EMNLP 2025
Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation
EMNLP 2025
Agent vs. Agent: Automated Data Generation and Red-Teaming for Custom Agentic Workflows
EMNLP 2025
Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation
EMNLP 2025
CLARITY: Clinical Assistant for Routing, Inference, and Triage
EMNLP 2025
How to Fine-Tune Safely on a Budget: Model Adaptation Using Minimal Resources
EMNLP 2025
VestaBench: An Embodied Benchmark for Safe Long-Horizon Planning Under Multi-Constraint and Adversarial Settings
EMNLP 2025
Safety in Large Reasoning Models: A Survey
EMNLP 2025
How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts?
EMNLP 2025
Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis
NAACL 2025
On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment
COLING 2025
<
1
…
28
29
30
…
119
>