Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Learning to Rewrite: Generalized LLM-Generated Text Detection ACL 2025

Biased LLMs can Influence Political Decision-Making ACL 2025

LLM as a Broken Telephone: Iterative Generation Distorts Information ACL 2025

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection ACL 2025

VLSBench: Unveiling Visual Leakage in Multimodal Safety ACL 2025

Exploiting the Shadows: Unveiling Privacy Leaks through Lower-Ranked Tokens in Large Language Models ACL 2025

PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization ACL 2025

InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes Under Herd Behavior ACL 2025

PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration ACL 2025

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs ACL 2025

Improving Factuality with Explicit Working Memory ACL 2025

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents ACL 2025

Jailbreaking? One Step Is Enough! ACL 2025

Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models ACL 2025

HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States ACL 2025

Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions ACL 2025

Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models ACL 2025

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis ACL 2025

Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch ACL 2025

Why Safeguarded Ships Run Aground? Aligned Large Language Models’ Safety Mechanisms Tend to Be Anchored in The Template Region ACL 2025

How to Mitigate Overfitting in Weak-to-strong Generalization? ACL 2025

M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs ACL 2025

Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech? ACL 2025

Do not Abstain! Identify and Solve the Uncertainty ACL 2025

Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective ACL 2025