Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt. Generation for Enhanced LLM Content Moderation
ACL 2025
Red-Teaming for Uncovering Societal Bias in Large Language Models
ACL 2025
Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective
ACL 2025
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models
ACL 2025
Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints
AAAI 2025
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
ACL 2025
On the Robustness of Distributed Machine Learning Against Transfer Attacks
AAAI 2025
Revisit Self-Debugging with Self-Generated Tests for Code Generation
ACL 2025
Identifying Predictions That Influence the Future: Detecting Performative Concept Drift in Data Streams
AAAI 2025
Exploring LLMs’ Ability to Spontaneously and Conditionally Modify Moral Expressions through Text Manipulation
ACL 2025
Merger-as-a-Stealer: Stealing Targeted PII from Aligned LLMs with Model Merging
EMNLP 2025
Can Indirect Prompt Injection Attacks Be Detected and Removed?
ACL 2025
Pragmatic Inference Chain (PIC) Improving LLMs’ Reasoning of Authentic Implicit Toxic Language
EMNLP 2025
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
ACL 2025
DAMON: A Dialogue-Aware MCTS Framework for Jailbreaking Large Language Models
EMNLP 2025
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
ACL 2025
Influence-Based Fair Selection for Sample-Discriminative Backdoor Attack
AAAI 2025
SConU: Selective Conformal Uncertainty in Large Language Models
ACL 2025
EverTracer: Hunting Stolen Large Language Models via Stealthy and Robust Probabilistic Fingerprint
EMNLP 2025
From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment
ACL 2025
IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves
ICCV 2025
LLMs can be easily Confused by Instructional Distractions
ACL 2025
Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling
AAAI 2025
Dynamic Evaluation with Cognitive Reasoning for Multi-turn Safety of Large Language Models
ACL 2025
LongSafety: Enhance Safety for Long-Context LLMs
ACL 2025
<
1
…
17
18
19
…
119
>