Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis EMNLP 2025

Where Confabulation Lives: Latent Feature Discovery in LLMs EMNLP 2025

Structured Moral Reasoning in Language Models: A Value-Grounded Evaluation Framework EMNLP 2025

Are Language Models Consequentialist or Deontological Moral Reasoners? EMNLP 2025

Pluralistic Alignment for Healthcare: A Role-Driven Framework EMNLP 2025

Are Stereotypes Leading LLMs’ Zero-Shot Stance Detection ? EMNLP 2025

TempParaphraser: “Heating Up” Text to Evade AI-Text Detection through Paraphrasing EMNLP 2025

Language Models Identify Ambiguities and Exploit Loopholes EMNLP 2025

Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions EMNLP 2025

Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification EMNLP 2025

A Simple Yet Effective Method for Non-Refusing Context Relevant Fine-grained Safety Steering in LLMs EMNLP 2025

ROBOTO2: An Interactive System and Dataset for LLM-assisted Clinical Trial Risk of Bias Assessment EMNLP 2025

SAGE: A Generic Framework for LLM Safety Evaluation EMNLP 2025

AutoCVSS: Assessing the Performance of LLMs for Automated Software Vulnerability Scoring EMNLP 2025

Towards Enforcing Company Policy Adherence in Agentic Workflows EMNLP 2025

Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt Generation for Enhanced LLM Content Moderation EMNLP 2025

Agent vs. Agent: Automated Data Generation and Red-Teaming for Custom Agentic Workflows EMNLP 2025

Format Inertia: A Failure Mechanism of LLMs in Medical Pre-Consultation EMNLP 2025

CLARITY: Clinical Assistant for Routing, Inference, and Triage EMNLP 2025

How to Fine-Tune Safely on a Budget: Model Adaptation Using Minimal Resources EMNLP 2025

VestaBench: An Embodied Benchmark for Safe Long-Horizon Planning Under Multi-Constraint and Adversarial Settings EMNLP 2025

Safety in Large Reasoning Models: A Survey EMNLP 2025

How Well Can Reasoning Models Identify and Recover from Unhelpful Thoughts? EMNLP 2025

Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis NAACL 2025

On Weaponization-Resistant Large Language Models with Prospect Theoretic Alignment COLING 2025