Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Detecting Mode Collapse in Language Models via Narration
EACL 2024
How should Conversational Agent systems respond to sexual harassment?
EACL 2024
Calibration-Tuning: Teaching Large Language Models to Know What They Don’t Know
EACL 2024
Linguistic Obfuscation Attacks and Large Language Model Uncertainty
EACL 2024
Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness
ACL 2024
Large Language Models Relearn Removed Concepts
ACL 2024
On the Vulnerability of Safety Alignment in Open-Access LLMs
ACL 2024
CR-UTP: Certified Robustness against Universal Text Perturbations on Large Language Models
ACL 2024
Boosting LLM Agents with Recursive Contemplation for Effective Deception Handling
ACL 2024
SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models
ACL 2024
Fixing Overconfidence in Dynamic Neural Networks
WACV 2024
Natural Light Can Also Be Dangerous: Traffic Sign Misinterpretation Under Adversarial Natural Light Attacks
WACV 2024
Exploring Adversarial Robustness of Vision Transformers in the Spectral Perspective
WACV 2024
On the Fly Neural Style Smoothing for Risk-Averse Domain Generalization
WACV 2024
PsyGUARD: An Automated System for Suicide Detection and Risk Assessment in Psychological Counseling
EMNLP 2024
Prompt Leakage effect and mitigation strategies for multi-turn LLM Applications
EMNLP 2024
Aligners: Decoupling LLMs and Alignment
EMNLP 2024
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
EMNLP 2024
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
EMNLP 2024
Making Harmful Behaviors Unlearnable for Large Language Models
ACL 2024
DORY: Deliberative Prompt Recovery for LLM
ACL 2024
Evaluating Robustness of Generative Search Engine on Adversarial Factoid Questions
ACL 2024
The Greatest Good Benchmark: Measuring LLMs’ Alignment with Utilitarian Moral Dilemmas
EMNLP 2024
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
EMNLP 2024
Preference-Guided Reflective Sampling for Aligning Language Models
EMNLP 2024
<
1
…
66
67
68
…
119
>