Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
A Robust Test for the Stationarity Assumption in Sequential Decision Making
ICML 2023
Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments
ICML 2023
Fake the Real: Backdoor Attack on Deep Speech Classification via Voice Conversion
INTERSPEECH 2023
LLMDet: A Third Party Large Language Models Generated Text Detection Tool
EMNLP 2023
Attack Prompt Generation for Red Teaming and Defending Large Language Models
EMNLP 2023
Multi-step Jailbreaking Privacy Attacks on ChatGPT
EMNLP 2023
ToxicChat: Unveiling Hidden Challenges of Toxicity Detection in Real-World User-AI Conversation
EMNLP 2023
A Critical Analysis of Document Out-of-Distribution Detection
EMNLP 2023
Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs
EMNLP 2023
INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations
EMNLP 2023
Sparse Black-Box Multimodal Attack for Vision-Language Adversary Generation
EMNLP 2023
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models
EMNLP 2023
Towards General Error Diagnosis via Behavioral Testing in Machine Translation
EMNLP 2023
Guiding LLM to Fool Itself: Automatically Manipulating Machine Reading Comprehension Shortcut Triggers
EMNLP 2023
Are Personalized Stochastic Parrots More Dangerous? Evaluating Persona Biases in Dialogue Systems
EMNLP 2023
InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning
EMNLP 2023
Can ChatGPT Defend its Belief in Truth? Evaluating LLM Reasoning via Debate
EMNLP 2023
LogicAttack: Adversarial Attacks for Evaluating Logical Consistency of Natural Language Inference
EMNLP 2023
Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in LLMs
EMNLP 2023
Learning to love diligent trolls: Accounting for rater effects in the dialogue safety task
EMNLP 2023
Adversarial Text Generation by Search and Learning
EMNLP 2023
Large Language Models as SocioTechnical Systems
EMNLP 2023
Identifying and Adapting Transformer-Components Responsible for Gender Bias in an English Language Model
EMNLP 2023
Unveiling Safety Vulnerabilities of Large Language Models
EMNLP 2023
Walking a Tightrope – Evaluating Large Language Models in High-Risk Domains
EMNLP 2023
<
1
…
85
86
87
…
119
>