Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Neutral Is Not Unbiased: Evaluating Implicit and Intersectional Identity Bias in LLMs Through Structured Narrative Scenarios EMNLP 2025

Computational Thinking with Computer Vision: Developing AI Competency in an Introductory Computer Science Course AAAI 2025

Prompt-Guided Internal States for Hallucination Detection of Large Language Models ACL 2025

English as Defense Proxy: Mitigating Multilingual Jailbreak via Eliciting English Safety Knowledge EMNLP 2025

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training EMNLP 2025

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique EMNLP 2025

Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception ACL 2025

DAMAGE: Detecting Adversarially Modified AI Generated Text COLING 2025

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation EMNLP 2025

Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models EMNLP 2025

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts ACL 2025

Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization EMNLP 2025

Too Helpful, Too Harmless, Too Honest or Just Right? EMNLP 2025

Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences EMNLP 2025

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models ACL 2025

MisinfoBench: A Multi-Dimensional Benchmark for Evaluating LLMs’ Resilience to Misinformation EMNLP 2025

DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent EMNLP 2025

Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity EMNLP 2025

Safety Alignment via Constrained Knowledge Unlearning ACL 2025

Enhancing the Adversarial Robustness via Manifold Projection AAAI 2025

COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems Against Semantic Attacks AAAI 2025

Unraveling Misinformation Propagation in LLM Reasoning EMNLP 2025

Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems ACL 2025

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks EMNLP 2025

Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone CVPR 2025