Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs NAACL 2024

Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield NAACL 2024

Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals COLING 2024

Simpler Becomes Harder: Do LLMs Exhibit a Coherent Behavior on Simplified Corpora? COLING 2024

Value FULCRA: Mapping Large Language Models to the Multidimensional Spectrum of Basic Human Value NAACL 2024

How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability AISTATS 2024

Composite Backdoor Attacks Against Large Language Models NAACL 2024

Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation NAACL 2024

Non-vacuous Generalization Bounds for Adversarial Risk in Stochastic Neural Networks AISTATS 2024

Extracting Prompts by Inverting LLM Outputs EMNLP 2024

Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector EMNLP 2024

Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction EMNLP 2024

Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks EMNLP 2024

BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models EMNLP 2024

CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models EMNLP 2024

MICo: Preventative Detoxification of Large Language Models through Inhibition Control NAACL 2024

ADEA: An Argumentative Dialogue Dataset on Ethical Issues Concerning Future A.I. Applications COLING 2024

Rethinking Machine Ethics – Can LLMs Perform Moral Reasoning through the Lens of Moral Theories? NAACL 2024

Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts NIPS 2024

Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack NIPS 2024

On scalable oversight with weak LLMs judging strong LLMs NIPS 2024

Task-Agnostic Detector for Insertion-Based Backdoor Attacks NAACL 2024

k-SemStamp: A Clustering-Based Semantic Watermark for Detection of Machine-Generated Text ACL 2024

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation ACL 2024

Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space ACL 2024