conftrace_

Artificial Intelligence › Core AI ›

Safety

414 papers

Papers per year

1

1

4

8

11

21

29

36

87

117

99

Papers

Language Detoxification with Attribute-Discriminative Latent Space ACL 2023

TextVerifier: Robustness Verification for Textual Classifiers with Certifiable Guarantees ACL 2023

Defending against Insertion-based Textual Backdoor Attacks via Attribution ACL 2023

Can Large Language Models Safely Address Patient Questions Following Cataract Surgery? ACL 2023

The Best Defense Is a Good Offense: Adversarial Augmentation Against Adversarial Attacks CVPR 2023

Unveiling the Implicit Toxicity in Large Language Models EMNLP 2023

ToViLaG: Your Visual-Language Generative Model is Also An Evildoer EMNLP 2023

Self-Detoxifying Language Models via Toxification Reversal EMNLP 2023

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition EMNLP 2023

Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models EMNLP 2023

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models EMNLP 2023

Towards Detecting Contextual Real-Time Toxicity for In-Game Chat EMNLP 2023

InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning EMNLP 2023

GTA: Gated Toxicity Avoidance for LM Performance Preservation EMNLP 2023

Constrained Update Projection Approach to Safe Policy Optimization NIPS 2022

On the Safety of Interpretable Machine Learning: A Maximum Deviation Approach NIPS 2022

Risk-Driven Design of Perception Systems NIPS 2022

Toward Robust Spiking Neural Network Against Adversarial Perturbation NIPS 2022

A Near-Optimal Primal-Dual Method for Off-Policy Learning in CMDP NIPS 2022

Increasing Confidence in Adversarial Robustness Evaluations NIPS 2022

Shield Decentralization for Safe Multi-Agent Reinforcement Learning NIPS 2022

Provable Defense against Backdoor Policies in Reinforcement Learning NIPS 2022

Enhancing Safe Exploration Using Safety State Augmentation NIPS 2022

Counterfactual harm NIPS 2022

Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning AAAI 2022