Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis
ACL 2024
FOFO: A Benchmark to Evaluate LLMs’ Format-Following Capability
ACL 2024
Self-Alignment for Factuality: Mitigating Hallucinations in LLMs via Self-Evaluation
ACL 2024
ValueBench: Towards Comprehensively Evaluating Value Orientations and Understanding of Large Language Models
ACL 2024
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
ACL 2024
Detoxifying Large Language Models via Knowledge Editing
ACL 2024
Navigating the OverKill in Large Language Models
ACL 2024
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
ACL 2024
Stealthy Attack on Large Language Model based Recommendation
ACL 2024
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
ACL 2024
Text Embedding Inversion Security for Multilingual Language Models
ACL 2024
Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
ACL 2024
Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization
ACL 2024
WatME: Towards Lossless Watermarking Through Lexical Redundancy
ACL 2024
BadAgent: Inserting and Activating Backdoor Attacks in LLM Agents
ACL 2024
Don’t be my Doctor! Recognizing Healthcare Advice in Large Language Models
EMNLP 2024
Can Machine Unlearning Reduce Social Bias in Language Models?
EMNLP 2024
ULMR: Unlearning Large Language Models via Negative Response and Model Parameter Average
EMNLP 2024
Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness
NIPS 2024
WaveAttack: Asymmetric Frequency Obfuscation-based Backdoor Attacks Against Deep Neural Networks
NIPS 2024
Trap-MID: Trapdoor-based Defense against Model Inversion Attacks
NIPS 2024
NN4SysBench: Characterizing Neural Network Verification for Computer Systems
NIPS 2024
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
NIPS 2024
HonestLLM: Toward an Honest and Helpful Large Language Model
NIPS 2024
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
NIPS 2024
<
1
…
59
60
61
…
119
>