Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Beyond Guardrails: Advanced Safety for Large Language Models — Monolingual, Multilingual and Multimodal Frontiers
IJCNLP 2025
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study
ACL 2025
Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization
IJCNLP 2025
Role-Aware Language Models for Secure and Contextualized Access Control in Organizations
IJCNLP 2025
On the Convergence of Moral Self-Correction in Large Language Models
IJCNLP 2025
Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs
IJCNLP 2025
Towards a Theory of AI Personhood
AAAI 2025
A Survey on LLM-Assisted Clinical Trial Recruitment
IJCNLP 2025
Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script
IJCNLP 2025
Information-theoretic Distinctions Between Deception and Confusion
IJCNLP 2025
Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
IJCNLP 2025
Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment
EMNLP 2025
When Truthful Representations Flip Under Deceptive Instructions?
EMNLP 2025
Swushroomsia at SemEval-2025 Task 3: Probing LLMs’ Collective Intelligence for Multilingual Hallucination Detection
SEMEVAL 2025
UCSC at SemEval-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in LLM Output
SEMEVAL 2025
SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection
EMNLP 2025
MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
EMNLP 2025
HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection
SEMEVAL 2025
Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety
EMNLP 2025
Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time
EMNLP 2025
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers
EMNLP 2025
TrojanWave: Exploiting Prompt Learning for Stealthy Backdoor Attacks on Large Audio-Language Models
EMNLP 2025
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
EMNLP 2025
Path Drift in Large Reasoning Models: How First-Person Commitments Override Safety
EMNLP 2025
Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis
EMNLP 2025
<
1
…
26
27
28
…
119
>