Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Language Models Resist Alignment: Evidence From Data Compression ACL 2025

Representation Bending for Large Language Model Safety ACL 2025

SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings ACL 2025

Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures ACL 2025

Arbiters of Ambivalence: Challenges of using LLMs in No-Consensus tasks ACL 2025

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval ACL 2025

Removing Prompt-template Bias in Reinforcement Learning from Human Feedback ACL 2025

COVER: Context-Driven Over-Refusal Verification in LLMs ACL 2025

Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems ACL 2025

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods EMNLP 2025

Understanding How Value Neurons Shape the Generation of Specified Values in LLMs EMNLP 2025

ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations EMNLP 2025

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks CVPR 2025

Sugar-Coated Poison: Benign Generation Unlocks Jailbreaking EMNLP 2025

RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation Analysis EMNLP 2025

Inverse Reinforcement Learning Meets Large Language Model Alignment ACL 2025

BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models CVPR 2025

Proxy Barrier: A Hidden Repeater Layer Defense Against System Prompt Leakage and Jailbreaking EMNLP 2025

MisinfoBench: A Multi-Dimensional Benchmark for Evaluating LLMs’ Resilience to Misinformation EMNLP 2025

Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization EMNLP 2025

Unraveling Misinformation Propagation in LLM Reasoning EMNLP 2025

Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences EMNLP 2025

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique EMNLP 2025

AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders EMNLP 2025

Red-Teaming LLM Multi-Agent Systems via Communication Attacks ACL 2025