Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
NAACL 2024
Robust Safety Classifier Against Jailbreaking Attacks: Adversarial Prompt Shield
NAACL 2024
Subspace Defense: Discarding Adversarial Perturbations by Learning a Subspace for Clean Signals
COLING 2024
Simpler Becomes Harder: Do LLMs Exhibit a Coherent Behavior on Simplified Corpora?
COLING 2024
Value FULCRA: Mapping Large Language Models to the Multidimensional Spectrum of Basic Human Value
NAACL 2024
How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability
AISTATS 2024
Composite Backdoor Attacks Against Large Language Models
NAACL 2024
Think While You Write: Hypothesis Verification Promotes Faithful Knowledge-to-Text Generation
NAACL 2024
Non-vacuous Generalization Bounds for Adversarial Risk in Stochastic Neural Networks
AISTATS 2024
Extracting Prompts by Inverting LLM Outputs
EMNLP 2024
Small Agent Can Also Rock! Empowering Small Language Models as Hallucination Detector
EMNLP 2024
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
EMNLP 2024
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
EMNLP 2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
EMNLP 2024
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
EMNLP 2024
MICo: Preventative Detoxification of Large Language Models through Inhibition Control
NAACL 2024
ADEA: An Argumentative Dialogue Dataset on Ethical Issues Concerning Future A.I. Applications
COLING 2024
Rethinking Machine Ethics – Can LLMs Perform Moral Reasoning through the Lens of Moral Theories?
NAACL 2024
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
NIPS 2024
Vaccine: Perturbation-aware Alignment for Large Language Models against Harmful Fine-tuning Attack
NIPS 2024
On scalable oversight with weak LLMs judging strong LLMs
NIPS 2024
Task-Agnostic Detector for Insertion-Based Backdoor Attacks
NAACL 2024
k-SemStamp: A Clustering-Based Semantic Watermark for Detection of Machine-Generated Text
ACL 2024
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
ACL 2024
Acquiring Clean Language Models from Backdoor Poisoned Datasets by Downscaling Frequency Space
ACL 2024
<
1
…
63
64
65
…
119
>