Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Do LLM hallucination detectors suffer from low-resource effect? EACL 2026

ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models EACL 2026

Utterance-level Detection Framework for LLM-Involved Content Detection in Conversational Setting EACL 2026

When the Model Said ‘No Comment’, We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified EACL 2026

FaithLM: Towards Faithful Explanations for Large Language Models EACL 2026

Attribution-Guided Multi-Object Hallucination and Bias Detection in Vision-Language Models EACL 2026

Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition WACV 2026

NP-Hard Lower Bound Complexity for Semantic Self-Verification EACL 2026

Adaptive Helpfulness–Harmlessness Alignment with Preference Vectors EACL 2026

Are My Optimized Prompts Compromised? Exploring Vulnerabilities of LLM-based Optimizers EACL 2026

Learning Multilingual Agentic Policy to Control Sycophancy EACL 2026

ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity EACL 2026

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation EACL 2026

Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models EACL 2026

BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain WACV 2026

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks WACV 2026

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities EACL 2026

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models EACL 2026

Attacker’s Noise Can Manipulate Your Audio-based LLM in the Real World EACL 2026

CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection EACL 2026

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons EACL 2026

Layer-wise Swapping for Generalizable Multilingual Safety EACL 2026

Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space EACL 2026

Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models EACL 2026

From Delegates to Trustees: How Optimizing for Long-Term Interests Shapes Bias and Alignment in LLMs EACL 2026