Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Unmasking Style Sensitivity: A Causal Analysis of Bias Evaluation Instability in Large Language Models ACL 2025

Jailbreak Attack Initializations as Extractors of Compliance Directions EMNLP 2025

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities ACL 2025

CLARITY: Clinical Assistant for Routing, Inference, and Triage EMNLP 2025

Understanding the Dark Side of LLMs’ Intrinsic Self-Correction ACL 2025

Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment EMNLP 2025

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation ACL 2025

Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling AAAI 2025

Contrasting Adversarial Perturbations: The Space of Harmless Perturbations AAAI 2025

GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization AAAI 2025

LongSafety: Enhance Safety for Long-Context LLMs ACL 2025

Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions ACL 2025

Using Humor to Bypass Safety Guardrails in Large Language Models ACL 2025

RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization ACL 2025

HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States ACL 2025

CRAFT: Class Ranking Aware Fine-Tuning for Enhanced Out-of-Distribution Detection WACV 2025

The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models COLING 2025

UTF: Under-trained Tokens as Fingerprints —— a Novel Approach to LLM Identification ACL 2025

Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates ACL 2025

Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models ACL 2025

Can LLMs Recognize Their Own Analogical Hallucinations? Evaluating Uncertainty Estimation for Analogical Reasoning ACL 2025

Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals ACL 2025

Jailbreaking? One Step Is Enough! ACL 2025

Pretend Benign: A Stealthy Adversarial Attack by Exploiting Vulnerabilities in Cooperative Perception ICCV 2025

ELAB: Extensive LLM Alignment Benchmark in Persian Language ACL 2025