Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Leashing the Inner Demons: Self-Detoxification for Language Models
AAAI 2022
Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks
EMNLP 2022
Diving Deep into Modes of Fact Hallucinations in Dialogue Systems
EMNLP 2022
Mitigating Covertly Unsafe Text within Natural Language Systems
EMNLP 2022
Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation
EMNLP 2022
SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures
ACL 2022
Contextualizing Language Models for Norms Diverging from Social Majority
EMNLP 2022
Language Model Detoxification in Dialogue with Contextualized Stance Control
EMNLP 2022
Foiling Training-Time Attacks on Neural Machine Translation Systems
EMNLP 2022
Neuro-Symbolic Verification of Deep Neural Networks
IJCAI 2022
Upstream Mitigation Is Not All You Need: Testing the Bias Transfer Hypothesis in Pre-Trained Language Models
ACL 2022
The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems
ACL 2022
Robots-Dont-Cry: Understanding Falsely Anthropomorphic Utterances in Dialog Systems
EMNLP 2022
To Trust or Not To Trust Prediction Scores for Membership Inference Attacks
IJCAI 2022
Imperceptible Backdoor Attack: From Input Space to Feature Representation
IJCAI 2022
SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems
ACL 2022
On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark
ACL 2022
Red Teaming Language Models with Language Models
EMNLP 2022
Pipelines for Social Bias Testing of Large Language Models
ACL 2022
Modeling Adversarial Noise for Adversarial Training
ICML 2022
Adversarial Robustness Guarantees for Gaussian Processes
JMLR 2022
Exploring Safer Behaviors for Deep Reinforcement Learning
AAAI 2022
Stability Verification in Stochastic Control Systems via Neural Network Supermartingales
AAAI 2022
Risk-graded Safety for Handling Medical Queries in Conversational AI
IJCNLP 2022
Where to Attack: A Dynamic Locator Model for Backdoor Attack in Text Classifications
COLING 2022
<
1
…
94
95
96
…
119
>