Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Token-Aware Editing of Internal Activations for Large Language Model Alignment EMNLP 2025

Reimagining Safety Alignment with An Image EMNLP 2025

Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection EMNLP 2025

Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models EMNLP 2025

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey EMNLP 2025

Learn and Unlearn: Addressing Misinformation in Multilingual LLMs EMNLP 2025

SUA: Stealthy Multimodal Large Language Model Unlearning Attack EMNLP 2025

Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge EMNLP 2025

EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety EMNLP 2025

Can an Individual Manipulate the Collective Decisions of Multi-Agents? EMNLP 2025

Data to Defense: The Role of Curation in Aligning Large Language Models Against Safety Compromise EMNLP 2025

Speculative Safety-Aware Decoding EMNLP 2025

Advancing Oversight Reasoning across Languages for Audit Sycophantic Behaviour via X-Agent EMNLP 2025

Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories EMNLP 2025

SimVBG: Simulating Individual Values by Backstory Generation EMNLP 2025

The Impact of Negated Text on Hallucination with Large Language Models EMNLP 2025

DiplomacyAgent: Do LLMs Balance Interests and Ethical Principles in International Events? EMNLP 2025

Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering EMNLP 2025

Improve LLM-as-a-Judge Ability as a General Ability EMNLP 2025

ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations EMNLP 2025

SPIRIT: Patching Speech Language Models against Jailbreak Attacks EMNLP 2025

Reward Model Perspectives: Whose Opinions Do Reward Models Reward? EMNLP 2025

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors EMNLP 2025

Jailbreak LLMs through Internal Stance Manipulation EMNLP 2025

Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis EMNLP 2025