Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Towards Safe Concept Transfer of Multi-Modal Diffusion via Causal Representation Editing
NIPS 2024
Learning Safety Constraints from Demonstrations with Unknown Rewards
AISTATS 2024
Formal Verification of Unknown Stochastic Systems via Non-parametric Estimation
AISTATS 2024
WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
NIPS 2024
Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems
AISTATS 2024
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models
EACL 2024
Reinforcement Learning with Adaptive Regularization for Safe Control of Critical Systems
NIPS 2024
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
NIPS 2024
Conformal Alignment: Knowing When to Trust Foundation Models with Guarantees
NIPS 2024
Direct Unlearning Optimization for Robust and Safe Text-to-Image Models
NIPS 2024
Theoretical Investigations and Practical Enhancements on Tail Task Risk Minimization in Meta Learning
NIPS 2024
A Theoretical Understanding of Self-Correction through In-context Alignment
NIPS 2024
Post-Hoc Reversal: Are We Selecting Models Prematurely?
NIPS 2024
Do-Not-Answer: Evaluating Safeguards in LLMs
EACL 2024
Automated Adversarial Discovery for Safety Classifiers
NAACL 2024
Random Smooth-based Certified Defense against Text Adversarial Attack
EACL 2024
Gradient-Based Language Model Red Teaming
EACL 2024
Advancing Beyond Identification: Multi-bit Watermark for Large Language Models
NAACL 2024
Universal Prompt Optimizer for Safe Text-to-Image Generation
NAACL 2024
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models
NIPS 2024
Diffusion Models are Certifiably Robust Classifiers
NIPS 2024
A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models
EMNLP 2024
Backdoor NLP Models via AI-Generated Text
COLING 2024
Detection, Diagnosis, and Explanation: A Benchmark for Chinese Medical Hallucination Evaluation
COLING 2024
Linguistic Rule Induction Improves Adversarial and OOD Robustness in Large Language Models
COLING 2024
<
1
…
69
70
71
…
119
>