Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Detecting Mode Collapse in Language Models via Narration EACL 2024

How should Conversational Agent systems respond to sexual harassment? EACL 2024

Calibration-Tuning: Teaching Large Language Models to Know What They Don’t Know EACL 2024

Linguistic Obfuscation Attacks and Large Language Model Uncertainty EACL 2024

Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness ACL 2024

Large Language Models Relearn Removed Concepts ACL 2024

On the Vulnerability of Safety Alignment in Open-Access LLMs ACL 2024

CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models ACL 2024

Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling ACL 2024

SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models ACL 2024

Fixing Overconfidence in Dynamic Neural Networks WACV 2024

Natural Light Can Also Be Dangerous: Traffic Sign Misinterpretation Under Adversarial Natural Light Attacks WACV 2024

Exploring Adversarial Robustness of Vision Transformers in the Spectral Perspective WACV 2024

On the Fly Neural Style Smoothing for Risk-Averse Domain Generalization WACV 2024

PsyGUARD: An Automated System for Suicide Detection and Risk Assessment in Psychological Counseling EMNLP 2024

Prompt Leakage effect and mitigation strategies for multi-turn LLM Applications EMNLP 2024

Aligners: Decoupling LLMs and Alignment EMNLP 2024

WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models EMNLP 2024

GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation EMNLP 2024

Making Harmful Behaviors Unlearnable for Large Language Models ACL 2024

DORY: Deliberative Prompt Recovery for LLM ACL 2024

Evaluating Robustness of Generative Search Engine on Adversarial Factoid Questions ACL 2024

The Greatest Good Benchmark: Measuring LLMs’ Alignment with Utilitarian Moral Dilemmas EMNLP 2024

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations EMNLP 2024

Preference-Guided Reflective Sampling for Aligning Language Models EMNLP 2024