Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Prompt-Guided Internal States for Hallucination Detection of Large Language Models
ACL 2025
Crossfire: An Elastic Defense Framework for Graph Neural Networks Under Bit Flip Attacks
AAAI 2025
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
ACL 2025
Backdoor Attack on Propagation-based Rumor Detectors
AAAI 2025
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
ACL 2025
Towards Computational Foreseeability
AAAI 2025
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
ACL 2025
Grimm: A Plug-and-Play Perturbation Rectifier for Graph Neural Networks Defending Against Poisoning Attacks
AAAI 2025
Safety Alignment via Constrained Knowledge Unlearning
ACL 2025
Probabilistic Shielding for Safe Reinforcement Learning
AAAI 2025
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
ACL 2025
Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning
AAAI 2025
Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation
ACL 2025
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
NAACL 2025
Internal Value Alignment in Large Language Models through Controlled Value Vector Activation
ACL 2025
Protecting Model Adaptation from Trojans in the Unlabeled Data
AAAI 2025
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
ACL 2025
Extracting and Understanding the Superficial Knowledge in Alignment
NAACL 2025
Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling
ACL 2025
COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems Against Semantic Attacks
AAAI 2025
SDD: Self-Degraded Defense against Malicious Fine-tuning
ACL 2025
SEAL: Systematic Error Analysis for Value ALignment
AAAI 2025
A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive
ACL 2025
Quantitative Predictive Monitoring and Control for Safe Human-Machine Interaction
AAAI 2025
LongSafety: Enhance Safety for Long-Context LLMs
ACL 2025
<
1
…
18
19
20
…
119
>