Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion ACL 2025

LongSafety: Evaluating Long-Context Safety of Large Language Models ACL 2025

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights ACL 2025

Adversarial Preference Learning for Robust LLM Alignment ACL 2025

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective ACL 2025

Extended Abstract: Probing-Guided Parameter-Efficient Fine-Tuning for Balancing Linguistic Adaptation and Safety in LLM-based Social Influence Systems ACL 2025

Detoxify-IT: An Italian Parallel Dataset for Text Detoxification ACL 2025

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes ACL 2025

Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions ACL 2025

Can GPTZero’s AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays? ACL 2025

Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack ACL 2025

Unmasking Style Sensitivity: A Causal Analysis of Bias Evaluation Instability in Large Language Models ACL 2025

SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities ACL 2025

Understanding the Dark Side of LLMs’ Intrinsic Self-Correction ACL 2025

Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation ACL 2025

Defending against Indirect Prompt Injection by Instruction Detection EMNLP 2025

Localizing Malicious Outputs from CodeLLM EMNLP 2025

MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework EMNLP 2025

TrapDoc: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents EMNLP 2025

One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems EMNLP 2025

Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation EMNLP 2025

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs EMNLP 2025

Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction EMNLP 2025

Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation EMNLP 2025

From Remembering to Metacognition: Do Existing Benchmarks Accurately Evaluate LLMs? EMNLP 2025