Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Exploiting the Shadows: Unveiling Privacy Leaks through Lower-Ranked Tokens in Large Language Models ACL 2025

Adversarial Robust Memory-Based Continual Learner ICCV 2025

Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion ICCV 2025

VLSBench: Unveiling Visual Leakage in Multimodal Safety ACL 2025

Are Stereotypes Leading LLMs’ Zero-Shot Stance Detection ? EMNLP 2025

Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks ICCV 2025

Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment ICCV 2025

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection ACL 2025

Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency ICCV 2025

Backdoor Mitigation by Distance-Driven Detoxification ICCV 2025

LLM as a Broken Telephone: Iterative Generation Distorts Information ACL 2025

LLMScan: Causal Scan for LLM Misbehavior Detection ICML 2025

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI 2025

LlmFixer: Fix the Helpfulness of Defensive Large Language Models EMNLP 2025

The Staircase of Ethics: Probing LLM Value Priorities through Multi-Step Induction to Complex Moral Dilemmas EMNLP 2025

Certified Mitigation of Worst-Case LLM Copyright Infringement EMNLP 2025

The Confidence Paradox: Can LLM Know When It’s Wrong? IJCNLP 2025

DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion ICCV 2025

Biased LLMs can Influence Political Decision-Making ACL 2025

Profiling LLM’s Copyright Infringement Risks under Adversarial Persuasive Prompting EMNLP 2025

SimVBG: Simulating Individual Values by Backstory Generation EMNLP 2025

Learning to Rewrite: Generalized LLM-Generated Text Detection ACL 2025

Neutral Is Not Unbiased: Evaluating Implicit and Intersectional Identity Bias in LLMs Through Structured Narrative Scenarios EMNLP 2025

Self-adaptive Dataset Construction for Real-World Multimodal Safety Scenarios EMNLP 2025

Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models NAACL 2025