Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Why Safeguarded Ships Run Aground? Aligned Large Language Models’ Safety Mechanisms Tend to Be Anchored in The Template Region ACL 2025

Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch ACL 2025

Towards a Theory of AI Personhood AAAI 2025

CALM: Curiosity-Driven Auditing for Large Language Models AAAI 2025

Certified Trustworthiness in the Era of Large Language Models AAAI 2025

Combating Phone Scams with LLM-based Detection: Where Do We Stand? (Student Abstract) AAAI 2025

Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency ACL 2025

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis ACL 2025

1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning ACL 2025

X-Guard: Multilingual Guard Agent for Content Moderation ACL 2025

Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models ACL 2025

Data with High and Consistent Preference Difference Are Better for Reward Model AAAI 2025

ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving ACL 2025

Guardians of Trust: Risks and Opportunities for LLMs in Mental Health ACL 2025

What Counts Underlying LLMs’ Moral Dilemma Judgments? ACL 2025

Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective ACL 2025

Safe in Isolation, Dangerous Together: Agent-Driven Multi-Turn Decomposition Jailbreaks on LLMs ACL 2025

iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss ACL 2025

AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking ACL 2025

SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes ACL 2025

SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models ACL 2025

PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust ACL 2025

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet ACL 2025

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety ACL 2025

LongSafety: Enhance Safety for Long-Context LLMs ACL 2025