Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Guardians of Trust: Risks and Opportunities for LLMs in Mental Health ACL 2025

What Counts Underlying LLMs’ Moral Dilemma Judgments? ACL 2025

Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective ACL 2025

Safe in Isolation, Dangerous Together: Agent-Driven Multi-Turn Decomposition Jailbreaks on LLMs ACL 2025

iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss ACL 2025

AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking ACL 2025

SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes ACL 2025

SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models ACL 2025

PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust ACL 2025

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet ACL 2025

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety ACL 2025

Detecting Child Objectification on Social Media: Challenges in Language Modeling ACL 2025

Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt. Generation for Enhanced LLM Content Moderation ACL 2025

Red-Teaming for Uncovering Societal Bias in Large Language Models ACL 2025

Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling AAAI 2025

Contrasting Adversarial Perturbations: The Space of Harmless Perturbations AAAI 2025

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization AAAI 2025

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks AAAI 2025

AUTE: Peer-Alignment and Self-Unlearning Boost Adversarial Robustness for Training Ensemble Models AAAI 2025

AIM: Additional Image Guided Generation of Transferable Adversarial Attacks AAAI 2025

Training Verification-Friendly Neural Networks via Neuron Behavior Consistency AAAI 2025

Efficient Robustness Evaluation via Constraint Relaxation AAAI 2025

First Line of Defense: A Robust First Layer Mitigates Adversarial Attacks AAAI 2025

ADBA: Approximation Decision Boundary Approach for Black-Box Adversarial Attacks AAAI 2025

Meme Trojan: Backdoor Attacks Against Hateful Meme Detection via Cross-Modal Triggers AAAI 2025