conftrace_

Artificial Intelligence › Core AI ›

Safety

414 papers

Papers per year

1

1

4

8

11

21

29

36

87

117

99

Papers

Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level ACL 2025

What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs ACL 2025

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models ACL 2025

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training ACL 2025

SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model ACL 2025

Multimodal Pragmatic Jailbreak on Text-to-image Models ACL 2025

Improve Safety Training of Large Language Models with Safety-Critical Singular Vectors Localization ACL 2025

Stepwise Reasoning Disruption Attack of LLMs ACL 2025

Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization ACL 2025

The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs ACL 2025

AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection ACL 2025

VLSBench: Unveiling Visual Leakage in Multimodal Safety ACL 2025

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs ACL 2025

Jailbreaking? One Step Is Enough! ACL 2025

HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States ACL 2025

Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models ACL 2025

TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis ACL 2025

Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech? ACL 2025

Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs ACL 2025

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling ACL 2025

AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models ACL 2025

Defense Against Prompt Injection Attack by Leveraging Attack Techniques ACL 2025

LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint ACL 2025

Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch ACL 2025

MPO: Multilingual Safety Alignment via Reward Gap Optimization ACL 2025