Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Do LLM hallucination detectors suffer from low-resource effect?
EACL 2026
ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
EACL 2026
Utterance-level Detection Framework for LLM-Involved Content Detection in Conversational Setting
EACL 2026
When the Model Said ‘No Comment’, We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified
EACL 2026
FaithLM: Towards Faithful Explanations for Large Language Models
EACL 2026
Attribution-Guided Multi-Object Hallucination and Bias Detection in Vision-Language Models
EACL 2026
Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition
WACV 2026
NP-Hard Lower Bound Complexity for Semantic Self-Verification
EACL 2026
Adaptive Helpfulness–Harmlessness Alignment with Preference Vectors
EACL 2026
Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers
EACL 2026
Learning Multilingual Agentic Policy to Control Sycophancy
EACL 2026
ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity
EACL 2026
When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation
EACL 2026
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models
EACL 2026
BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain
WACV 2026
UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks
WACV 2026
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
EACL 2026
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
EACL 2026
Attacker’s Noise Can Manipulate Your Audio-based LLM in the Real World
EACL 2026
CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection
EACL 2026
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
EACL 2026
Layer-wise Swapping for Generalizable Multilingual Safety
EACL 2026
Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space
EACL 2026
Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models
EACL 2026
From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLMs
EACL 2026
<
1
2
3
4
5
…
119
>