Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt. Generation for Enhanced LLM Content Moderation ACL 2025

Red-Teaming for Uncovering Societal Bias in Large Language Models ACL 2025

Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective ACL 2025

AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models ACL 2025

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints AAAI 2025

Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations ACL 2025

On the Robustness of Distributed Machine Learning Against Transfer Attacks AAAI 2025

Revisit Self-Debugging with Self-Generated Tests for Code Generation ACL 2025

Identifying Predictions That Influence the Future: Detecting Performative Concept Drift in Data Streams AAAI 2025

Exploring LLMs’ Ability to Spontaneously and Conditionally Modify Moral Expressions through Text Manipulation ACL 2025

Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging EMNLP 2025

Can Indirect Prompt Injection Attacks Be Detected and Removed? ACL 2025

Pragmatic Inference Chain (PIC) Improving LLMs’ Reasoning of Authentic Implicit Toxic Language EMNLP 2025

Defense Against Prompt Injection Attack by Leveraging Attack Techniques ACL 2025

DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models EMNLP 2025

HAF-RM: A Hybrid Alignment Framework for Reward Model Training ACL 2025

Influence-Based Fair Selection for Sample-Discriminative Backdoor Attack AAAI 2025

SConU: Selective Conformal Uncertainty in Large Language Models ACL 2025

EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint EMNLP 2025

From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment ACL 2025

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves ICCV 2025

LLMs can be easily Confused by Instructional Distractions ACL 2025

Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling AAAI 2025

Dynamic Evaluation with Cognitive Reasoning for Multi-turn Safety of Large Language Models ACL 2025

LongSafety: Enhance Safety for Long-Context LLMs ACL 2025