Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Permitted Knowledge Boundary: Evaluating the Knowledge-Constrained Responsiveness of Large Language Models EMNLP 2025

FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts EMNLP 2025

SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention EMNLP 2025

Improving Alignment in LVLMs with Debiased Self-Judgment EMNLP 2025

Distributional Surgery for Language Model Activations EMNLP 2025

Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents EMNLP 2025

Towards Reverse Engineering of Language Models: A Survey EMNLP 2025

Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models EMNLP 2025

SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals EMNLP 2025

Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models EMNLP 2025

A Knapsack by Any Other Name: Presentation impacts LLM performance on NP-hard problems EMNLP 2025

LLM Jailbreak Detection for (Almost) Free! EMNLP 2025

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability EMNLP 2025

Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks EMNLP 2025

Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks EMNLP 2025

How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations EMNLP 2025

PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement EMNLP 2025

Multilingual Collaborative Defense for Large Language Models EMNLP 2025

PD3F: A Pluggable and Dynamic DoS-Defense Framework against resource consumption attacks targeting Large Language Models EMNLP 2025

LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models EMNLP 2025

SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models EMNLP 2025

Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity EMNLP 2025

Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models EMNLP 2025

PromptKeeper: Safeguarding System Prompts for LLMs EMNLP 2025

ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts EMNLP 2025