Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Optimal Zero-Shot Detector for Multi-Armed Attacks AISTATS 2024

Large Language Models Must Be Taught to Know What They Don’t Know NIPS 2024

MoGU: A Framework for Enhancing Safety of LLMs While Preserving Their Usability NIPS 2024

Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense NIPS 2024

POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning CVPR 2024

Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders ACL 2024

POLICEd RL: Learning Closed-Loop Robot Control Policies with Provable Satisfaction of Hard Constraints RSS 2024

DA3: A Distribution-Aware Adversarial Attack against Language Models EMNLP 2024

Optimistic Verifiable Training by Controlling Hardware Nondeterminism NIPS 2024

Real-Time Anomaly Detection and Reactive Planning with Large Language Models RSS 2024

Axioms for AI Alignment from Human Feedback NIPS 2024

Merging AI Incidents Research with Political Misinformation Research: Introducing the Political Deepfakes Incidents Database AAAI 2024

ProMark: Proactive Diffusion Watermarking for Causal Attribution CVPR 2024

MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection CVPR 2024

Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining CVPR 2024

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning CVPR 2024

Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM ACL 2024

The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models ACL 2024

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails ACL 2024

SoFA: Shielded On-the-fly Alignment via Priority Rule Following ACL 2024

A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models ACL 2024

UOR: Universal Backdoor Attacks on Pre-trained Language Models ACL 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch NIPS 2024

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents NIPS 2024

Can I trust You? LLMs as conversational agents EACL 2024