Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Neural Policy Safety Verification via Predicate Abstraction: CEGAR AAAI 2023

Understanding and Enhancing Robustness of Concept-Based Models AAAI 2023

WAT: Improve the Worst-Class Robustness in Adversarial Training AAAI 2023

Quantization-Aware Interval Bound Propagation for Training Certifiably Robust Quantized Neural Networks AAAI 2023

A Semidefinite Relaxation Based Branch-and-Bound Method for Tight Neural Network Verification AAAI 2023

Out-of-Distribution Detection Is Not All You Need AAAI 2023

PatchNAS: Repairing DNNs in Deployment with Patched Network Architecture Search AAAI 2023

Jailbroken: How Does LLM Safety Training Fail? NIPS 2023

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting NIPS 2023

Setting the Trap: Capturing and Defeating Backdoors in Pretrained Language Models through Honeypots NIPS 2023

TrojLLM: A Black-box Trojan Prompt Attack on Large Language Models NIPS 2023

On the Exploitability of Instruction Tuning NIPS 2023

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care NIPS 2023

Collaborative Alignment of NLP Models NIPS 2023

DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models NIPS 2023

Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense NIPS 2023

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints NIPS 2023

Iterative Reachability Estimation for Safe Reinforcement Learning NIPS 2023

Seeing is not Believing: Robust Reinforcement Learning against Spurious Correlation NIPS 2023

Survival Instinct in Offline Reinforcement Learning NIPS 2023

Provably Safe Reinforcement Learning with Step-wise Violation Constraints NIPS 2023

Sample-Efficient and Safe Deep Reinforcement Learning via Reset Deep Ensemble Agents NIPS 2023

Behavior Alignment via Reward Function Optimization NIPS 2023

BIRD: Generalizable Backdoor Detection and Removal for Deep Reinforcement Learning NIPS 2023

Corruption-Robust Offline Reinforcement Learning with General Function Approximation NIPS 2023