Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Detecting Child Objectification on Social Media: Challenges in Language Modeling
ACL 2025
Do not Abstain! Identify and Solve the Uncertainty
ACL 2025
Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt. Generation for Enhanced LLM Content Moderation
ACL 2025
Unraveling Misinformation Propagation in LLM Reasoning
EMNLP 2025
Red-Teaming for Uncovering Societal Bias in Large Language Models
ACL 2025
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Models
CVPR 2025
Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective
ACL 2025
Citation Drift: Measuring Reference Stability in Multi-Turn LLM Conversations
IJCNLP 2025
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models
ACL 2025
What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios
COLING 2025
MisinfoBench: A Multi-Dimensional Benchmark for Evaluating LLMs’ Resilience to Misinformation
EMNLP 2025
Weakest Link in the Chain: Security Vulnerabilities in Advanced Reasoning Models
ACL 2025
Faithful and Robust LLM-Driven Theorem Proving for NLI Explanations
ACL 2025
CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation
COLING 2025
BadToken: Token-level Backdoor Attacks to Multi-modal Large Language Models
CVPR 2025
Immune: Improving Safety Against Jailbreaks in Multi-modal LLMs via Inference-Time Alignment
CVPR 2025
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
COLING 2025
Exploring Backdoor Vulnerabilities of Chat Models
COLING 2025
Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding
COLING 2025
Monte Carlo Tree Search Based Prompt Autogeneration for Jailbreak Attacks against LLMs
COLING 2025
Cognitive Biases, Task Complexity, and Result Interpretability in Large Language Models
COLING 2025
“Not Aligned” is Not “Malicious”: Being Careful about Hallucinations of Large Language Models’ Jailbreak
COLING 2025
Intention Analysis Makes LLMs A Good Jailbreak Defender
COLING 2025
Revisiting Jailbreaking for Large Language Models: A Representation Engineering Perspective
COLING 2025
Robust Preference Optimization via Dynamic Target Margins
ACL 2025
<
1
…
34
35
36
…
119
>