Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems IJCNLP 2025

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs IJCNLP 2025

On the Convergence of Moral Self-Correction in Large Language Models IJCNLP 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization IJCNLP 2025

Transferable Direct Prompt Injection via Activation-Guided MCMC Sampling EMNLP 2025

Foot-In-The-Door: A Multi-turn Jailbreak for LLMs EMNLP 2025

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code COLING 2025

Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings COLING 2025

Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models COLING 2025

Chat Bankman-Fried: an Exploration of LLM Alignment in Finance COLING 2025

SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs COLING 2025

Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts COLING 2025

Mirror Minds : An Empirical Study on Detecting LLM-Generated Text via LLMs COLING 2025

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs EMNLP 2025

Persuasion Dynamics in LLMs: Investigating Robustness and Adaptability in Knowledge and Safety with DuET-PD EMNLP 2025

IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents EMNLP 2025

Refusal-Aware Red Teaming: Exposing Inconsistency in Safety Evaluations EMNLP 2025

Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning EMNLP 2025

EMNLP: Educator-role Moral and Normative Large Language Models Profiling EMNLP 2025

Atoxia: Red-teaming Large Language Models with Target Toxic Answers NAACL 2025

Challenges in Trustworthy Human Evaluation of Chatbots NAACL 2025

Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture NAACL 2025

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Models NAACL 2025

Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis NAACL 2025

Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety NAACL 2025