Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Scaling the Convex Barrier with Sparse Dual Algorithms JMLR 2024

TrojFSP: Trojan Insertion in Few-shot Prompt Tuning NAACL 2024

Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models NAACL 2024

Look Who’s Talking Now: Covert Channels From Biased LLMs EMNLP 2024

IterAlign: Iterative Constitutional Alignment of Large Language Models NAACL 2024

FLIRT: Feedback Loop In-context Red Teaming EMNLP 2024

SELF-GUARD: Empower the LLM to Safeguard Itself NAACL 2024

MART: Improving LLM Safety with Multi-round Automatic Red-Teaming NAACL 2024

A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily NAACL 2024

How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities NAACL 2024

Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation NIPS 2024

AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents NIPS 2024

Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning ACL 2024

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs ACL 2024

From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards ACL 2024

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion ACL 2024

Do Zombies Understand? A Choose-Your-Own-Adventure Exploration of Machine Cognition ACL 2024

LIRE: listwise reward enhancement for preference alignment ACL 2024

The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts ACL 2024

One-Shot Safety Alignment for Large Language Models via Optimal Dualization NIPS 2024

Bias Detection via Signaling NIPS 2024

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models NAACL 2024

CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants NAACL 2024

Towards a Unified Framework for Adaptable Problematic Content Detection via Continual Learning NAACL 2024

Ethics in Action: Training Reinforcement Learning Agents for Moral Decision-making In Text-based Adventure Games AISTATS 2024