Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
EMNLP 2024
OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research
JMLR 2024
Intent-Aware and Hate-Mitigating Counterspeech Generation via Dual-Discriminator Guided LLMs
COLING 2024
Safety filters for black-box dynamical systems by learning discriminating hyperplanes
L4DC 2024
Generalized constraint for probabilistic safe reinforcement learning
L4DC 2024
Do no harm: A counterfactual approach to safe reinforcement learning
L4DC 2024
Hacking predictors means hacking cars: Using sensitivity analysis to identify trajectory prediction vulnerabilities for autonomous driving security
L4DC 2024
From raw data to safety: Reducing conservatism by set expansion
L4DC 2024
You don’t need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments
NAACL 2024
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting
NAACL 2024
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
NAACL 2024
Attacks, Defenses and Evaluations for LLM Conversation Safety: A Survey
NAACL 2024
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights
EMNLP 2024
A System to Detect Forged-Origin BGP Hijacks
NSDI 2024
Knowing What LLMs DO NOT Know: A Simple Yet Effective Self-Detection Method
NAACL 2024
Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations
AISTATS 2024
R-Tuning: Instructing Large Language Models to Say ‘I Don’t Know’
NAACL 2024
BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting
EMNLP 2024
Advancing the Robustness of Large Language Models through Self-Denoised Smoothing
NAACL 2024
Removing RLHF Protections in GPT-4 via Fine-Tuning
NAACL 2024
Language Models are Alignable Decision-Makers: Dataset and Application to the Medical Triage Domain
NAACL 2024
Citation: A Key to Building Responsible and Accountable Large Language Models
NAACL 2024
ReEval: Automatic Hallucination Evaluation for Retrieval-Augmented Large Language Models via Transferable Adversarial Attacks
NAACL 2024
Towards Healthy AI: Large Language Models Need Therapists Too
NAACL 2024
Cross-Task Defense: Instruction-Tuning LLMs for Content Safety
NAACL 2024
<
1
…
62
63
64
…
119
>