Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Neutral Is Not Unbiased: Evaluating Implicit and Intersectional Identity Bias in LLMs Through Structured Narrative Scenarios
EMNLP 2025
Computational Thinking with Computer Vision: Developing AI Competency in an Introductory Computer Science Course
AAAI 2025
Prompt-Guided Internal States for Hallucination Detection of Large Language Models
ACL 2025
English as Defense Proxy: Mitigating Multilingual Jailbreak via Eliciting English Safety Knowledge
EMNLP 2025
Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training
EMNLP 2025
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
EMNLP 2025
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
ACL 2025
DAMAGE: Detecting Adversarially Modified AI Generated Text
COLING 2025
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
EMNLP 2025
Soft Token Attacks Cannot Reliably Audit Unlearning in Large Language Models
EMNLP 2025
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
ACL 2025
Making Every Step Effective: Jailbreaking Large Vision-Language Models Through Hierarchical KV Equalization
EMNLP 2025
Too Helpful, Too Harmless, Too Honest or Just Right?
EMNLP 2025
Let Them Down Easy! Contextual Effects of LLM Guardrails on User Perceptions and Preferences
EMNLP 2025
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
ACL 2025
MisinfoBench: A Multi-Dimensional Benchmark for Evaluating LLMs’ Resilience to Misinformation
EMNLP 2025
DemonAgent: Dynamically Encrypted Multi-Backdoor Implantation Attack on LLM-based Agent
EMNLP 2025
Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity
EMNLP 2025
Safety Alignment via Constrained Knowledge Unlearning
ACL 2025
Enhancing the Adversarial Robustness via Manifold Projection
AAAI 2025
COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems Against Semantic Attacks
AAAI 2025
Unraveling Misinformation Propagation in LLM Reasoning
EMNLP 2025
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
ACL 2025
Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks
EMNLP 2025
Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone
CVPR 2025
<
1
…
36
37
38
…
119
>