Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Self-Detoxifying Language Models via Toxification Reversal
EMNLP 2023
Can We Edit Factual Knowledge by In-Context Learning?
EMNLP 2023
Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs Through a Global Prompt Hacking Competition
EMNLP 2023
Speak, Memory: An Archaeology of Books Known to ChatGPT/GPT-4
EMNLP 2023
Copyright Violations and Large Language Models
EMNLP 2023
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
EMNLP 2023
StereoMap: Quantifying the Awareness of Human-like Stereotypes in Large Language Models
EMNLP 2023
Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
EMNLP 2023
Hallucination Detection for Generative Large Language Models by Bayesian Sequential Estimation
EMNLP 2023
Security Challenges in Natural Language Processing Models
EMNLP 2023
Mitigating Societal Harms in Large Language Models
EMNLP 2023
LM-Polygraph: Uncertainty Estimation for Language Models
EMNLP 2023
AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
EMNLP 2023
Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness
EMNLP 2023
Toxicity in chatgpt: Analyzing persona-assigned language models
EMNLP 2023
Mitigating Backdoor Poisoning Attacks through the Lens of Spurious Correlation
EMNLP 2023
Toward Stronger Textual Attack Detectors
EMNLP 2023
Watermarking LLMs with Weight Quantization
EMNLP 2023
Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk
IJCAI 2022
Textual Manifold-based Defense Against Natural Language Adversarial Examples
EMNLP 2022
Are You Stealing My Model? Sample Correlation for Fingerprinting Deep Neural Networks
NIPS 2022
Training with More Confidence: Mitigating Injected and Natural Backdoors During Training
NIPS 2022
Randomized Message-Interception Smoothing: Gray-box Certificates for Graph Neural Networks
NIPS 2022
Provably Adversarially Robust Detection of Out-of-Distribution Data (Almost) for Free
NIPS 2022
MORA: Improving Ensemble Robustness Evaluation with Model Reweighing Attack
NIPS 2022
<
1
…
92
93
94
…
119
>