Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

A Robust Test for the Stationarity Assumption in Sequential Decision Making ICML 2023

Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments ICML 2023

Fake the Real: Backdoor Attack on Deep Speech Classification via Voice Conversion INTERSPEECH 2023

LLMDet: A Third Party Large Language Models Generated Text Detection Tool EMNLP 2023

Attack Prompt Generation for Red Teaming and Defending Large Language Models EMNLP 2023

Multi-step Jailbreaking Privacy Attacks on ChatGPT EMNLP 2023

ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation EMNLP 2023

A Critical Analysis of Document Out-of-Distribution Detection EMNLP 2023

Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs EMNLP 2023

INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations EMNLP 2023

Sparse Black-Box Multimodal Attack for Vision-Language Adversary Generation EMNLP 2023

ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models EMNLP 2023

Towards General Error Diagnosis via Behavioral Testing in Machine Translation EMNLP 2023

Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading Comprehension Shortcut Triggers EMNLP 2023

Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems EMNLP 2023

InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning EMNLP 2023

Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate EMNLP 2023

LogicAttack: Adversarial Attacks for Evaluating Logical Consistency of Natural Language Inference EMNLP 2023

Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs EMNLP 2023

Learning to love diligent trolls: Accounting for rater effects in the dialogue safety task EMNLP 2023

Adversarial Text Generation by Search and Learning EMNLP 2023

Large Language Models as SocioTechnical Systems EMNLP 2023

Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model EMNLP 2023

Unveiling Safety Vulnerabilities of Large Language Models EMNLP 2023

Walking a Tightrope – Evaluating Large Language Models in High-Risk Domains EMNLP 2023