Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Reconsidering Deception in Social Robotics: The Role of Human Vulnerability (Student Abstract)
AAAI 2023
Robust Multi-bit Natural Language Watermarking through Invariant Features
ACL 2023
Explanation-based Finetuning Makes Models More Robust to Spurious Cues
ACL 2023
Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark
ACL 2023
Targeted Data Generation: Finding and Fixing Model Weaknesses
ACL 2023
BITE: Textual Backdoor Attacks with Iterative Trigger Injection
ACL 2023
Nichelle and Nancy: The Influence of Demographic Attributes and Tokenization Length on First Name Biases
ACL 2023
Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning
ACL 2023
A Gradient Control Method for Backdoor Attacks on Parameter-Efficient Tuning
ACL 2023
Don’t Retrain, Just Rewrite: Countering Adversarial Perturbations by Rewriting Text
ACL 2023
Maximum Entropy Loss, the Silver Bullet Targeting Backdoor Attacks in Pre-trained Language Models
ACL 2023
Foveate, Attribute, and Rationalize: Towards Physically Safe and Trustworthy AI
ACL 2023
Reward Gaming in Conditional Text Generation
ACL 2023
ClarifyDelphi: Reinforced Clarification Questions with Defeasibility Rewards for Social and Moral Situations
ACL 2023
Certified Robustness via Dynamic Margin Maximization and Improved Lipschitz Regularization
NIPS 2023
Reliability Check: An Analysis of GPT-3’s Response to Sensitive Topics and Prompt Wording
ACL 2023
Adversarial Textual Robustness on Visual Dialog
ACL 2023
Logic-driven Indirect Supervision: An Application to Crisis Counseling
ACL 2023
FORK: A Bite-Sized Test Set for Probing Culinary Cultural Biases in Commonsense Reasoning Models
ACL 2023
Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language Models Caused by Backdoor or Bias
ACL 2023
Can Large Language Models Safely Address Patient Questions Following Cataract Surgery?
ACL 2023
Unveiling the Implicit Toxicity in Large Language Models
EMNLP 2023
The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values
EMNLP 2023
The Troubling Emergence of Hallucination in Large Language Models - An Extensive Definition, Quantification, and Prescriptive Remediations
EMNLP 2023
TrojanSQL: SQL Injection against Natural Language Interface to Database
EMNLP 2023
<
1
…
91
92
93
…
119
>