Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Context-aware Watermark with Semantic Balanced Green-red Lists for Large Language Models
EMNLP 2024
“They are uncultured”: Unveiling Covert Harms and Social Threats in LLM Generated Conversations
EMNLP 2024
RCL: Reliable Continual Learning for Unified Failure Detection
CVPR 2024
Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation
CVPR 2024
Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training
CVPR 2024
Controlling Counterfactual Harm in Decision Support Systems Based on Prediction Sets
NIPS 2024
Relational Verification Leaps Forward with RABBit
NIPS 2024
Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks
CVPR 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
NIPS 2024
Refusal in Language Models Is Mediated by a Single Direction
NIPS 2024
Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration
NIPS 2024
Detecting Bugs with Substantial Monetary Consequences by LLM and Rule-based Reasoning
NIPS 2024
Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World
CVPR 2024
Exploiting Class Probabilities for Black-box Sentence-level Attacks
EACL 2024
Inconsistent dialogue responses and how to recover from them
EACL 2024
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
NIPS 2024
Many-shot Jailbreaking
NIPS 2024
SafeWorld: Geo-Diverse Safety Alignment
NIPS 2024
Query-Based Adversarial Prompt Generation
NIPS 2024
Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes
NIPS 2024
Who's asking? User personas and the mechanics of latent misalignment
NIPS 2024
MedBN: Robust Test-Time Adaptation against Malicious Test Samples
CVPR 2024
ZeroMark: Towards Dataset Ownership Verification without Disclosing Watermark
NIPS 2024
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
NIPS 2024
Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds
CVPR 2024
<
1
…
78
79
80
…
119
>