Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Steering Llama 2 via Contrastive Activation Addition
ACL 2024
Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack
NIPS 2024
InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents
ACL 2024
Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge
ACL 2024
BadActs: A Universal Backdoor Defense in the Activation Space
ACL 2024
RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors
ACL 2024
Extreme Miscalibration and the Illusion of Adversarial Robustness
ACL 2024
How Susceptible are Large Language Models to Ideological Manipulation?
EMNLP 2024
Don’t Just Say “I don’t know”! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations
EMNLP 2024
Expert-level protocol translation for self-driving labs
NIPS 2024
Effective Ambiguity Attack Against Passport-Based DNN Intellectual Property Protection Schemes Through Fully Connected Layer Substitution
CVPR 2023
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
NIPS 2023
RealBehavior: A Framework for Faithfully Characterizing Foundation Models’ Human-like Behavior Mechanisms
EMNLP 2023
The Resource Problem of Using Linear Layer Leakage Attack in Federated Learning
CVPR 2023
Utterance Classification with Logical Neural Network: Explainable AI for Mental Disorder Diagnosis
ACL 2023
NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models
ACL 2023
Query-Efficient Black-Box Red Teaming via Bayesian Optimization
ACL 2023
Improving the robustness of NLI models with minimax training
ACL 2023
SafeConv: Explaining and Correcting Conversational Unsafe Behavior
ACL 2023
Holistic Adversarial Robustness of Deep Learning Models
AAAI 2023
Towards Safe AI: Sandboxing DNNs-Based Controllers in Stochastic Games
AAAI 2023
Reachability Analysis of Neural Network Control Systems
AAAI 2023
DeepGemini: Verifying Dependency Fairness for Deep Neural Network
AAAI 2023
Beyond NaN: Resiliency of Optimization Layers in the Face of Infeasibility
AAAI 2023
Towards Verifying the Geometric Robustness of Large-Scale Neural Networks
AAAI 2023
<
1
…
79
80
81
…
119
>