Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment ACL 2025

SConU: Selective Conformal Uncertainty in Large Language Models ACL 2025

X-Guard: Multilingual Guard Agent for Content Moderation ACL 2025

HAF-RM: A Hybrid Alignment Framework for Reward Model Training ACL 2025

Defense Against Prompt Injection Attack by Leveraging Attack Techniques ACL 2025

PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust ACL 2025

LLM Agents Can Be Choice-Supportive Biased Evaluators: An Empirical Study AAAI 2025

Can Indirect Prompt Injection Attacks Be Detected and Removed? ACL 2025

Exploring LLMs’ Ability to Spontaneously and Conditionally Modify Moral Expressions through Text Manipulation ACL 2025

Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models AAAI 2025

Revisit Self-Debugging with Self-Generated Tests for Code Generation ACL 2025

Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations ACL 2025

iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss ACL 2025

CL-Attack: Textual Backdoor Attacks via Cross-Lingual Triggers AAAI 2025

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models AAAI 2025

AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models ACL 2025

Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective ACL 2025

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints AAAI 2025

Do not Abstain! Identify and Solve the Uncertainty ACL 2025

Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech? ACL 2025

Look Before You Leap: Enhance Attention and Vigilance Regarding Harmful Content with GuidelineLLM AAAI 2025

Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models AAAI 2025

M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs ACL 2025

How to Mitigate Overfitting in Weak-to-strong Generalization? ACL 2025

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI 2025