Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance EMNLP 2024

OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research JMLR 2024

Intent-Aware and Hate-Mitigating Counterspeech Generation via Dual-Discriminator Guided LLMs COLING 2024

Safety filters for black-box dynamical systems by learning discriminating hyperplanes L4DC 2024

Generalized constraint for probabilistic safe reinforcement learning L4DC 2024

Do no harm: A counterfactual approach to safe reinforcement learning L4DC 2024

Hacking predictors means hacking cars: Using sensitivity analysis to identify trajectory prediction vulnerabilities for autonomous driving security L4DC 2024

From raw data to safety: Reducing conservatism by set expansion L4DC 2024

You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments NAACL 2024

Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting NAACL 2024

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models NAACL 2024

Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey NAACL 2024

Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights EMNLP 2024

A System to Detect Forged-Origin BGP Hijacks NSDI 2024

Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method NAACL 2024

Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations AISTATS 2024

R-Tuning: Instructing Large Language Models to Say ‘I Don’t Know’ NAACL 2024

BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting EMNLP 2024

Advancing the Robustness of Large Language Models through Self-Denoised Smoothing NAACL 2024

Removing RLHF Protections in GPT-4 via Fine-Tuning NAACL 2024

Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain NAACL 2024

Citation: A Key to Building Responsible and Accountable Large Language Models NAACL 2024

ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks NAACL 2024

Towards Healthy AI: Large Language Models Need Therapists Too NAACL 2024

Cross-Task Defense: Instruction-Tuning LLMs for Content Safety NAACL 2024