Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling
ACL 2025
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
ACL 2025
Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers
AAAI 2025
Internal Value Alignment in Large Language Models through Controlled Value Vector Activation
ACL 2025
Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation
ACL 2025
AUTE: Peer-Alignment and Self-Unlearning Boost Adversarial Robustness for Training Ensemble Models
AAAI 2025
Tuning-Free Accountable Intervention for LLM Deployment – a Metacognitive Approach
AAAI 2025
Mitigating Social Bias in Large Language Models: A Multi-Objective Approach Within a Multi-Agent Framework
AAAI 2025
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
ACL 2025
Safety Alignment via Constrained Knowledge Unlearning
ACL 2025
Contrasting Adversarial Perturbations: The Space of Harmless Perturbations
AAAI 2025
Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
ACL 2025
LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts
ACL 2025
GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization
AAAI 2025
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
AAAI 2025
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
ACL 2025
Prompt-Guided Internal States for Hallucination Detection of Large Language Models
ACL 2025
Measuring Error Alignment for Decision-Making Systems
AAAI 2025
Dynamic Evaluation with Cognitive Reasoning for Multi-turn Safety of Large Language Models
ACL 2025
LLMs can be easily Confused by Instructional Distractions
ACL 2025
PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks
AAAI 2025
All You Need Is S P A C E: When Jailbreaking Meets Bias Audit and Reveals What Lies Beneath the Guardrails (Student Abstract)
AAAI 2025
Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models
AAAI 2025
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning
AAAI 2025
LLM Agents Can Be Choice-Supportive Biased Evaluators: An Empirical Study
AAAI 2025
<
1
…
14
15
16
…
119
>