Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Self-Detoxifying Language Models via Toxification Reversal EMNLP 2023

Can We Edit Factual Knowledge by In-Context Learning? EMNLP 2023

Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition EMNLP 2023

Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4 EMNLP 2023

Copyright Violations and Large Language Models EMNLP 2023

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models EMNLP 2023

StereoMap: Quantifying the Awareness of Human-like Stereotypes in Large Language Models EMNLP 2023

Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models EMNLP 2023

Hallucination Detection for Generative Large Language Models by Bayesian Sequential Estimation EMNLP 2023

Security Challenges in Natural Language Processing Models EMNLP 2023

Mitigating Societal Harms in Large Language Models EMNLP 2023

LM-Polygraph: Uncertainty Estimation for Language Models EMNLP 2023

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications EMNLP 2023

Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness EMNLP 2023

Toxicity in chatgpt: Analyzing persona-assigned language models EMNLP 2023

Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation EMNLP 2023

Toward Stronger Textual Attack Detectors EMNLP 2023

Watermarking LLMs with Weight Quantization EMNLP 2023

Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk IJCAI 2022

Textual Manifold-based Defense Against Natural Language Adversarial Examples EMNLP 2022

Are You Stealing My Model? Sample Correlation for Fingerprinting Deep Neural Networks NIPS 2022

Training with More Confidence: Mitigating Injected and Natural Backdoors During Training NIPS 2022

Randomized Message-Interception Smoothing: Gray-box Certificates for Graph Neural Networks NIPS 2022

Provably Adversarially Robust Detection of Out-of-Distribution Data (Almost) for Free NIPS 2022

MORA: Improving Ensemble Robustness Evaluation with Model Reweighing Attack NIPS 2022