Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Leashing the Inner Demons: Self-Detoxification for Language Models AAAI 2022

Expose Backdoors on the Way: A Feature-Based Efficient Defense against Textual Backdoor Attacks EMNLP 2022

Diving Deep into Modes of Fact Hallucinations in Dialogue Systems EMNLP 2022

Mitigating Covertly Unsafe Text within Natural Language Systems EMNLP 2022

Constructing Highly Inductive Contexts for Dialogue Safety through Controllable Reverse Generation EMNLP 2022

SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures ACL 2022

Contextualizing Language Models for Norms Diverging from Social Majority EMNLP 2022

Language Model Detoxification in Dialogue with Contextualized Stance Control EMNLP 2022

Foiling Training-Time Attacks on Neural Machine Translation Systems EMNLP 2022

Neuro-Symbolic Verification of Deep Neural Networks IJCAI 2022

Upstream Mitigation Is Not All You Need: Testing the Bias Transfer Hypothesis in Pre-Trained Language Models ACL 2022

The Moral Integrity Corpus: A Benchmark for Ethical Dialogue Systems ACL 2022

Robots-Dont-Cry: Understanding Falsely Anthropomorphic Utterances in Dialog Systems EMNLP 2022

To Trust or Not To Trust Prediction Scores for Membership Inference Attacks IJCAI 2022

Imperceptible Backdoor Attack: From Input Space to Feature Representation IJCAI 2022

SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems ACL 2022

On the Safety of Conversational Models: Taxonomy, Dataset, and Benchmark ACL 2022

Red Teaming Language Models with Language Models EMNLP 2022

Pipelines for Social Bias Testing of Large Language Models ACL 2022

Modeling Adversarial Noise for Adversarial Training ICML 2022

Adversarial Robustness Guarantees for Gaussian Processes JMLR 2022

Exploring Safer Behaviors for Deep Reinforcement Learning AAAI 2022

Stability Verification in Stochastic Control Systems via Neural Network Supermartingales AAAI 2022

Risk-graded Safety for Handling Medical Queries in Conversational AI IJCNLP 2022

Where to Attack: A Dynamic Locator Model for Backdoor Attack in Text Classifications COLING 2022