Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Formal Synthesis of Safe Kolmogorov-Arnold Network Controllers with Barrier Certificates
IJCAI 2025
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free
ACL 2025
ProcessBench: Identifying Process Errors in Mathematical Reasoning
ACL 2025
LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion
ACL 2025
Neuron Similarity-Based Neural Network Verification via Abstraction and Refinement
IJCAI 2025
LongSafety: Evaluating Long-Context Safety of Large Language Models
ACL 2025
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
ACL 2025
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
ACL 2025
Model Rake: A Defense Against Stealing Attacks in Split Learning
IJCAI 2025
Adversarial Preference Learning for Robust LLM Alignment
ACL 2025
Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures
ACL 2025
Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective
ACL 2025
COVER: Context-Driven Over-Refusal Verification in LLMs
ACL 2025
Extended Abstract: Probing-Guided Parameter-Efficient Fine-Tuning for Balancing Linguistic Adaptation and Safety in LLM-based Social Influence Systems
ACL 2025
Proxy Barrier: A Hidden Repeater Layer Defense Against System Prompt Leakage and Jailbreaking
EMNLP 2025
Detoxify-IT: An Italian Parallel Dataset for Text Detoxification
ACL 2025
Unraveling Misinformation Propagation in LLM Reasoning
EMNLP 2025
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes
ACL 2025
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
EMNLP 2025
Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions
ACL 2025
Backdoor Attack on Propagation-based Rumor Detectors
AAAI 2025
Can GPTZero’s AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays?
ACL 2025
Can You Trick the Grader? Adversarial Persuasion of LLM Judges
EMNLP 2025
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack
ACL 2025
LongSafety: Enhance Safety for Long-Context LLMs
ACL 2025
<
1
…
19
20
21
…
119
>