Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Scaling the Convex Barrier with Sparse Dual Algorithms
JMLR 2024
TrojFSP: Trojan Insertion in Few-shot Prompt Tuning
NAACL 2024
Ensuring Safe and High-Quality Outputs: A Guideline Library Approach for Language Models
NAACL 2024
Look Who’s Talking Now: Covert Channels From Biased LLMs
EMNLP 2024
IterAlign: Iterative Constitutional Alignment of Large Language Models
NAACL 2024
FLIRT: Feedback Loop In-context Red Teaming
EMNLP 2024
SELF-GUARD: Empower the LLM to Safeguard Itself
NAACL 2024
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
NAACL 2024
A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
NAACL 2024
How Trustworthy are Open-Source LLMs? An Assessment under Malicious Demonstrations Shows their Vulnerabilities
NAACL 2024
Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation
NIPS 2024
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
NIPS 2024
Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning
ACL 2024
Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs
ACL 2024
From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards
ACL 2024
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
ACL 2024
Do Zombies Understand? A Choose-Your-Own-Adventure Exploration of Machine Cognition
ACL 2024
LIRE: listwise reward enhancement for preference alignment
ACL 2024
The Language Barrier: Dissecting Safety Challenges of LLMs in Multilingual Contexts
ACL 2024
One-Shot Safety Alignment for Large Language Models via Optimal Dualization
NIPS 2024
Bias Detection via Signaling
NIPS 2024
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models
NAACL 2024
CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants
NAACL 2024
Towards a Unified Framework for Adaptable Problematic Content Detection via Continual Learning
NAACL 2024
Ethics in Action: Training Reinforcement Learning Agents for Moral Decision-making In Text-based Adventure Games
AISTATS 2024
<
1
…
68
69
70
…
119
>