Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Covert Bias: The Severity of Social Views’ Unalignment in Language Models Towards Implicit and Explicit Opinion
EMNLP 2024
Active Learning for Robust and Representative LLM Generation in Safety-Critical Scenarios
EMNLP 2024
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
EMNLP 2024
WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions
EMNLP 2024
LLM Internal States Reveal Hallucination Risk Faced With a Query
EMNLP 2024
Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection
EMNLP 2024
Virtual Context Enhancing Jailbreak Attacks with Special Token Injection
EMNLP 2024
NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding
EMNLP 2024
Self-Evolution Fine-Tuning for Policy Optimization
EMNLP 2024
LLMGuard: Guarding against Unsafe LLM Behavior
AAAI 2024
Evaluating AI Red Teaming’s Readiness to Address Environmental Harms: A Thematic Analysis of LLM Discourse
AAAI 2024
Autonomous Policy Explanations for Effective Human-Machine Teaming
AAAI 2024
The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning
AAAI 2024
Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students
AAAI 2024
Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models
AAAI 2024
Visual Adversarial Examples Jailbreak Aligned Large Language Models
AAAI 2024
SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models
AAAI 2024
MathAttack: Attacking Large Language Models towards Math Solving Ability
AAAI 2024
Preference Ranking Optimization for Human Alignment
AAAI 2024
Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents
NIPS 2024
Normative Testimony and Belief Functions: A Formal Theory of Norm Learning
IJCAI 2024
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
NIPS 2024
Hyper-opinion Evidential Deep Learning for Out-of-Distribution Detection
NIPS 2024
LLM Evaluators Recognize and Favor Their Own Generations
NIPS 2024
DALD: Improving Logits-based Detector without Logits from Black-box LLMs
NIPS 2024
<
1
…
76
77
78
…
119
>