Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
IJCNLP 2025
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
IJCNLP 2025
On the Convergence of Moral Self-Correction in Large Language Models
IJCNLP 2025
Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
IJCNLP 2025
Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling
EMNLP 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
EMNLP 2025
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
COLING 2025
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
COLING 2025
Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models
COLING 2025
Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
COLING 2025
SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs
COLING 2025
Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts
COLING 2025
Mirror Minds : An Empirical Study on Detecting LLM-Generated Text via LLMs
COLING 2025
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
EMNLP 2025
Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD
EMNLP 2025
IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents
EMNLP 2025
Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations
EMNLP 2025
Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning
EMNLP 2025
EMNLP: Educator-role Moral and Normative Large Language Models Profiling
EMNLP 2025
Atoxia: Red-teaming Large Language Models with Target Toxic Answers
NAACL 2025
Challenges in Trustworthy Human Evaluation of Chatbots
NAACL 2025
Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture
NAACL 2025
Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Models
NAACL 2025
Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis
NAACL 2025
Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety
NAACL 2025
<
1
…
46
47
48
…
119
>