Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Beyond Guardrails: Advanced Safety for Large Language Models — Monolingual, Multilingual and Multimodal Frontiers IJCNLP 2025

Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study ACL 2025

Online Learning Defense against Iterative Jailbreak Attacks via Prompt Optimization IJCNLP 2025

Role-Aware Language Models for Secure and Contextualized Access Control in Organizations IJCNLP 2025

On the Convergence of Moral Self-Correction in Large Language Models IJCNLP 2025

Hidden in Plain Text: Emergence & Mitigation of Steganographic Collusion in LLMs IJCNLP 2025

Towards a Theory of AI Personhood AAAI 2025

A Survey on LLM-Assisted Clinical Trial Recruitment IJCNLP 2025

Human-in-the-Loop Generation of Adversarial Texts: A Case Study on Tibetan Script IJCNLP 2025

Information-theoretic Distinctions Between Deception and Confusion IJCNLP 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems IJCNLP 2025

Attention Eclipse: Manipulating Attention to Bypass LLM Safety-Alignment EMNLP 2025

When Truthful Representations Flip Under Deceptive Instructions? EMNLP 2025

Swushroomsia at SemEval-2025 Task 3: Probing LLMs’ Collective Intelligence for Multilingual Hallucination Detection SEMEVAL 2025

UCSC at SemEval-2025 Task 3: Context, Models and Prompt Optimization for Automated Hallucination Detection in LLM Output SEMEVAL 2025

SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection EMNLP 2025

MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities EMNLP 2025

HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection SEMEVAL 2025

Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety EMNLP 2025

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time EMNLP 2025

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers EMNLP 2025

TrojanWave: Exploiting Prompt Learning for Stealthy Backdoor Attacks on Large Audio-Language Models EMNLP 2025

Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets EMNLP 2025

Path Drift in Large Reasoning Models: How First-Person Commitments Override Safety EMNLP 2025

Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis EMNLP 2025