Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
From Outcomes to Processes: Guiding PRM Learning from ORM for Inference-Time Alignment
ACL 2025
SConU: Selective Conformal Uncertainty in Large Language Models
ACL 2025
X-Guard: Multilingual Guard Agent for Content Moderation
ACL 2025
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
ACL 2025
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
ACL 2025
PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust
ACL 2025
LLM Agents Can Be Choice-Supportive Biased Evaluators: An Empirical Study
AAAI 2025
Can Indirect Prompt Injection Attacks Be Detected and Removed?
ACL 2025
Exploring LLMs’ Ability to Spontaneously and Conditionally Modify Moral Expressions through Text Manipulation
ACL 2025
Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models
AAAI 2025
Revisit Self-Debugging with Self-Generated Tests for Code Generation
ACL 2025
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
ACL 2025
iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss
ACL 2025
CL-Attack: Textual Backdoor Attacks via Cross-Lingual Triggers
AAAI 2025
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
AAAI 2025
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models
ACL 2025
Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective
ACL 2025
DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints
AAAI 2025
Do not Abstain! Identify and Solve the Uncertainty
ACL 2025
Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech?
ACL 2025
Look Before You Leap: Enhance Attention and Vigilance Regarding Harmful Content with GuidelineLLM
AAAI 2025
Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models
AAAI 2025
M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs
ACL 2025
How to Mitigate Overfitting in Weak-to-strong Generalization?
ACL 2025
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning
AAAI 2025
<
1
…
15
16
17
…
119
>