Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
GumbelSoft: Diversified Language Model Watermarking via the GumbelMax-trick
ACL 2024
Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty
ACL 2024
More than Minorities and Majorities: Understanding Multilateral Bias in Language Generation
ACL 2024
Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning
NAACL 2024
Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking
NAACL 2024
Achieving Domain-Independent Certified Robustness via Knowledge Continuity
NIPS 2024
Truthful High Dimensional Sparse Linear Regression
NIPS 2024
Improving Alignment and Robustness with Circuit Breakers
NIPS 2024
Provably Safe Neural Network Controllers via Differential Dynamic Logic
NIPS 2024
Measuring Goal-Directedness
NIPS 2024
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
NIPS 2024
A theoretical case-study of Scalable Oversight in Hierarchical Reinforcement Learning
NIPS 2024
Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning
NIPS 2024
Learning Human-like Representations to Enable Learning Human Values
NIPS 2024
Simplifying Constraint Inference with Inverse Reinforcement Learning
NIPS 2024
Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor
NIPS 2024
Predicting Future Actions of Reinforcement Learning Agents
NIPS 2024
RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning
NIPS 2024
Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint)
AAAI 2024
Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models
NAACL 2024
Safe Linear Bandits over Unknown Polytopes
COLT 2024
On the Computability of Robust PAC Learning
COLT 2024
Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic
ACL 2024
Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain
NAACL 2024
The power of an adversary in Glauber dynamics
COLT 2024
<
1
…
64
65
66
…
119
>