Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Reconsidering Deception in Social Robotics: The Role of Human Vulnerability (Student Abstract) AAAI 2023

Robust Multi-bit Natural Language Watermarking through Invariant Features ACL 2023

Explanation-based Finetuning Makes Models More Robust to Spurious Cues ACL 2023

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark ACL 2023

Targeted Data Generation: Finding and Fixing Model Weaknesses ACL 2023

BITE: Textual Backdoor Attacks with Iterative Trigger Injection ACL 2023

Nichelle and Nancy: The Influence of Demographic Attributes and Tokenization Length on First Name Biases ACL 2023

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning ACL 2023

A Gradient Control Method for Backdoor Attacks on Parameter-Efficient Tuning ACL 2023

Don’t Retrain, Just Rewrite: Countering Adversarial Perturbations by Rewriting Text ACL 2023

Maximum Entropy Loss, the Silver Bullet Targeting Backdoor Attacks in Pre-trained Language Models ACL 2023

Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI ACL 2023

Reward Gaming in Conditional Text Generation ACL 2023

ClarifyDelphi: Reinforced Clarification Questions with Defeasibility Rewards for Social and Moral Situations ACL 2023

Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization NIPS 2023

Reliability Check: An Analysis of GPT-3’s Response to Sensitive Topics and Prompt Wording ACL 2023

Adversarial Textual Robustness on Visual Dialog ACL 2023

Logic-driven Indirect Supervision: An Application to Crisis Counseling ACL 2023

FORK: A Bite-Sized Test Set for Probing Culinary Cultural Biases in Commonsense Reasoning Models ACL 2023

Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language Models Caused by Backdoor or Bias ACL 2023

Can Large Language Models Safely Address Patient Questions Following Cataract Surgery? ACL 2023

Unveiling the Implicit Toxicity in Large Language Models EMNLP 2023

The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values EMNLP 2023

The Troubling Emergence of Hallucination in Large Language Models - An Extensive Definition, Quantification, and Prescriptive Remediations EMNLP 2023

TrojanSQL: SQL Injection against Natural Language Interface to Database EMNLP 2023