Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Fairness Shields: Safeguarding against Biased Decision Makers AAAI 2025

Probabilistic Shielding for Safe Reinforcement Learning AAAI 2025

Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning AAAI 2025

Stop Diverse OOD Attacks: Knowledge Ensemble for Reliable Defense AAAI 2025

The Partially Observable Off-Switch Game AAAI 2025

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems AAAI 2025

SEAL: Systematic Error Analysis for Value ALignment AAAI 2025

ME: Modelling Ethical Values for Value Alignment AAAI 2025

Leveraging Human Input to Enable Robust, Interactive, and Aligned AI Systems AAAI 2025

Axioms for AI Alignment from Human Feedback AAAI 2025

An Evolutionary Perspective on AI Alignment (Student Abstract) AAAI 2025

RESF: Regularized-Entropy-Sensitive Fingerprinting for Black-Box Tamper Detection of Large Language Models EMNLP 2025

SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models EMNLP 2025

TombRaider: Entering the Vault of History to Jailbreak Large Language Models EMNLP 2025

SEPS: A Separability Measure for Robust Unlearning in LLMs EMNLP 2025

MoSEs: Uncertainty-Aware AI-Generated Text Detection via Mixture of Stylistics Experts with Conditional Thresholds EMNLP 2025

Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging EMNLP 2025

Pragmatic Inference Chain (PIC) Improving LLMs’ Reasoning of Authentic Implicit Toxic Language EMNLP 2025

Detoxifying Large Language Models via the Diversity of Toxic Samples EMNLP 2025

VisCRA: A Visual Chain Reasoning Attack for Jailbreaking Multimodal Large Language Models EMNLP 2025

Rethinking Backdoor Detection Evaluation for Language Models EMNLP 2025

DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models EMNLP 2025

EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint EMNLP 2025

Addressing Tokenization Inconsistency in Steganography and Watermarking Based on Large Language Models EMNLP 2025

Attacks by Content: Automated Fact-checking is an AI Security Issue EMNLP 2025