Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Improving Consistency in LLM Inference using Probabilistic Tokenization
NAACL 2025
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response
NAACL 2025
A Practical Examination of AI-Generated Text Detectors for Large Language Models
NAACL 2025
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
NAACL 2025
Aligning to What? Limits to RLHF Based Alignment
NAACL 2025
Tongue-Tied: Breaking LLMs Safety Through New Language Learning
NAACL 2025
SSNTrio@DravidianLangTech 2025: Identification of AI Generated Content in Dravidian Languages using Transformers
NAACL 2025
CIC-NLP@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
NAACL 2025
A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient
NAACL 2025
A Comprehensive Evaluation of Cognitive Biases in LLMs
NAACL 2025
Smaller Large Language Models Can Do Moral Self-Correction
NAACL 2025
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine
NAACL 2025
Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries
NAACL 2025
Multi-lingual Multi-turn Automated Red Teaming for LLMs
NAACL 2025
Summary the Savior: Harmful Keyword and Query-based Summarization for LLM Jailbreak Defense
NAACL 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
NAACL 2025
Automating Steering for Safe Multimodal Large Language Models
EMNLP 2025
PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization
EMNLP 2025
DISTIL: Data-Free Inversion of Suspicious Trojan Inputs via Latent Diffusion
ICCV 2025
Backdoor Mitigation by Distance-Driven Detoxification
ICCV 2025
Jailbreaking Multimodal Large Language Models via Shuffle Inconsistency
ICCV 2025
Towards Adversarial Robustness via Debiased High-Confidence Logit Alignment
ICCV 2025
Safeguarding Vision-Language Models: Mitigating Vulnerabilities to Gaussian Noise in Perturbation-based Attacks
ICCV 2025
Gradient-Reweighted Adversarial Camouflage for Physical Object Detection Evasion
ICCV 2025
Adversarial Robust Memory-Based Continual Learner
ICCV 2025
<
1
…
47
48
49
…
119
>