Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models
ACL 2025
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
ACL 2025
Revisit Self-Debugging with Self-Generated Tests for Code Generation
ACL 2025
Exploring LLMs’ Ability to Spontaneously and Conditionally Modify Moral Expressions through Text Manipulation
ACL 2025
Can Indirect Prompt Injection Attacks Be Detected and Removed?
ACL 2025
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
ACL 2025
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
ACL 2025
SConU: Selective Conformal Uncertainty in Large Language Models
ACL 2025
From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment
ACL 2025
LLMs can be easily Confused by Instructional Distractions
ACL 2025
Dynamic Evaluation with Cognitive Reasoning for Multi-turn Safety of Large Language Models
ACL 2025
Prompt-Guided Internal States for Hallucination Detection of Large Language Models
ACL 2025
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
ACL 2025
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
ACL 2025
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
ACL 2025
Safety Alignment via Constrained Knowledge Unlearning
ACL 2025
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
ACL 2025
Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation
ACL 2025
Internal Value Alignment in Large Language Models through Controlled Value Vector Activation
ACL 2025
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
ACL 2025
Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling
ACL 2025
SDD: Self-Degraded Defense against Malicious Fine-tuning
ACL 2025
A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive
ACL 2025
MEraser: An Effective Fingerprint Erasure Approach for Large Language Models
ACL 2025
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free
ACL 2025
<
1
…
42
43
44
…
119
>