Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models ACL 2025

Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations ACL 2025

Revisit Self-Debugging with Self-Generated Tests for Code Generation ACL 2025

Exploring LLMs’ Ability to Spontaneously and Conditionally Modify Moral Expressions through Text Manipulation ACL 2025

Can Indirect Prompt Injection Attacks Be Detected and Removed? ACL 2025

Defense Against Prompt Injection Attack by Leveraging Attack Techniques ACL 2025

HAF-RM: A Hybrid Alignment Framework for Reward Model Training ACL 2025

SConU: Selective Conformal Uncertainty in Large Language Models ACL 2025

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment ACL 2025

LLMs can be easily Confused by Instructional Distractions ACL 2025

Dynamic Evaluation with Cognitive Reasoning for Multi-turn Safety of Large Language Models ACL 2025

Prompt-Guided Internal States for Hallucination Detection of Large Language Models ACL 2025

Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception ACL 2025

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts ACL 2025

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models ACL 2025

Safety Alignment via Constrained Knowledge Unlearning ACL 2025

Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems ACL 2025

Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation ACL 2025

Internal Value Alignment in Large Language Models through Controlled Value Vector Activation ACL 2025

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges ACL 2025

Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling ACL 2025

SDD: Self-Degraded Defense against Malicious Fine-tuning ACL 2025

A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive ACL 2025

MEraser: An Effective Fingerprint Erasure Approach for Large Language Models ACL 2025

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free ACL 2025