Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Advancing Oversight Reasoning across Languages for Audit Sycophantic Behaviour via X-Agent EMNLP 2025

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training EMNLP 2025

Speculative Safety-Aware Decoding EMNLP 2025

Can You Trick the Grader? Adversarial Persuasion of LLM Judges EMNLP 2025

The Confidence Paradox: Can LLM Know When It’s Wrong? IJCNLP 2025

Neutral Is Not Unbiased: Evaluating Implicit and Intersectional Identity Bias in LLMs Through Structured Narrative Scenarios EMNLP 2025

Towards Better Value Principles for Large Language Model Alignment: A Systematic Evaluation and Enhancement ACL 2025

Data to Defense: The Role of Curation in Aligning Large Language Models Against Safety Compromise EMNLP 2025

Can an Individual Manipulate the Collective Decisions of Multi-Agents? EMNLP 2025

On Guardrail Models’ Robustness to Mutations and Adversarial Attacks EMNLP 2025

MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming ACL 2025

Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models EMNLP 2025

EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety EMNLP 2025

Beneath the Facade: Probing Safety Vulnerabilities in LLMs via Auto-Generated Jailbreak Prompts EMNLP 2025

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge EMNLP 2025

BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages IJCNLP 2025

Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs EMNLP 2025

Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling AAAI 2025

SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment ACL 2025

Assessing Reliability and Political Bias In LLMs’ Judgements of Formal and Material Inferences With Partisan Conclusions ACL 2025

PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks EMNLP 2025

FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain EMNLP 2025

SUA: Stealthy Multimodal Large Language Model Unlearning Attack EMNLP 2025

Learn and Unlearn: Addressing Misinformation in Multilingual LLMs EMNLP 2025

Challenges and Remedies of Domain-Specific Classifiers as LLM Guardrails: Self-Harm as a Case Study NAACL 2025