Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token
NIPS 2024
Representation Noising: A Defence Mechanism Against Harmful Finetuning
NIPS 2024
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
NIPS 2024
Fast Best-of-N Decoding via Speculative Rejection
NIPS 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
NIPS 2024
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
NIPS 2024
Unveiling the Bias Impact on Symmetric Moral Consistency of Large Language Models
NIPS 2024
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
NIPS 2024
Selective Generation for Controllable Language Models
NIPS 2024
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
NIPS 2024
Calibrated Self-Rewarding Vision Language Models
NIPS 2024
Stealth edits to large language models
NIPS 2024
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models
NIPS 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
NIPS 2024
Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature
NIPS 2024
NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security
NIPS 2024
Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation
EMNLP 2024
On the Influence of Gender and Race in Romantic Relationship Prediction from Large Language Models
EMNLP 2024
Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection
EMNLP 2024
Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections
EMNLP 2024
AgentReview: Exploring Peer Review Dynamics with LLM Agents
EMNLP 2024
Towards Tool Use Alignment of Large Language Models
EMNLP 2024
Evaluating Psychological Safety of Large Language Models
EMNLP 2024
Alignment-Enhanced Decoding: Defending Jailbreaks via Token-Level Adaptive Refining of Probability Distributions
EMNLP 2024
Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism
EMNLP 2024
<
1
…
60
61
62
…
119
>