Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Hermit Kingdom Through the Lens of Multiple Perspectives: A Case Study of LLM Hallucination on North Korea COLING 2025

Jailbreak LLMs through Internal Stance Manipulation EMNLP 2025

CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation COLING 2025

Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models COLING 2025

What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios COLING 2025

PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks EMNLP 2025

MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique EMNLP 2025

HAF-RM: A Hybrid Alignment Framework for Reward Model Training ACL 2025

Towards Statistical Factuality Guarantee for Large Vision-Language Models EMNLP 2025

FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain EMNLP 2025

CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs EMNLP 2025

SConU: Selective Conformal Uncertainty in Large Language Models ACL 2025

SafeScientist: Enhancing AI Scientist Safety for Risk-Aware Scientific Discovery EMNLP 2025

Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens EMNLP 2025

Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience EMNLP 2025

SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters NAACL 2025

Decoding Hate: Exploring Language Models’ Reactions to Hate Speech NAACL 2025

SafetyQuizzer: Timely and Dynamic Evaluation on the Safety of LLMs NAACL 2025

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring NAACL 2025

From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection NAACL 2025

Have LLMs Reopened the Pandora’s Box of AI-Generated Fake News? NAACL 2025

Extracting and Understanding the Superficial Knowledge in Alignment NAACL 2025

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems ACL 2025

Dynamic Evaluation with Cognitive Reasoning for Multi-turn Safety of Large Language Models ACL 2025

HalLoc: Token-level Localization of Hallucinations for Vision Language Models CVPR 2025