Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Language Models Resist Alignment: Evidence From Data Compression
ACL 2025
Representation Bending for Large Language Model Safety
ACL 2025
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
ACL 2025
Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures
ACL 2025
Arbiters of Ambivalence: Challenges of using LLMs in No-Consensus tasks
ACL 2025
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
ACL 2025
Removing Prompt-template Bias in Reinforcement Learning from Human Feedback
ACL 2025
COVER: Context-Driven Over-Refusal Verification in LLMs
ACL 2025
Bypassing LLM Guardrails: An Empirical Analysis of Evasion Attacks against Prompt Injection and Jailbreak Detection Systems
ACL 2025
Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods
EMNLP 2025
Understanding How Value Neurons Shape the Generation of Specified Values in LLMs
EMNLP 2025
ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations
EMNLP 2025
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
CVPR 2025
Sugar-Coated Poison: Benign Generation Unlocks Jailbreaking
EMNLP 2025
RevPRAG: Revealing Poisoning Attacks in Retrieval-Augmented Generation through LLM Activation Analysis
EMNLP 2025
Inverse Reinforcement Learning Meets Large Language Model Alignment
ACL 2025
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
CVPR 2025
Proxy Barrier: A Hidden Repeater Layer Defense Against System Prompt Leakage and Jailbreaking
EMNLP 2025
MisinfoBench: A Multi-Dimensional Benchmark for Evaluating LLMs’ Resilience to Misinformation
EMNLP 2025
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization
EMNLP 2025
Unraveling Misinformation Propagation in LLM Reasoning
EMNLP 2025
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences
EMNLP 2025
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
EMNLP 2025
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders
EMNLP 2025
Red-Teaming LLM Multi-Agent Systems via Communication Attacks
ACL 2025
<
1
…
39
40
41
…
119
>