conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
414 papers
Papers per year
2016: 1
1
2017: 1
1
2018: 4
4
2019: 8
8
2020: 11
11
2021: 21
21
2022: 29
29
2023: 36
36
2024: 87
87
2025: 117
117
2026: 99
99
Papers
Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level
ACL 2025
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
ACL 2025
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
ACL 2025
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
ACL 2025
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
ACL 2025
Multimodal Pragmatic Jailbreak on Text-to-image Models
ACL 2025
Improve Safety Training of Large Language Models with Safety-Critical Singular Vectors Localization
ACL 2025
Stepwise Reasoning Disruption Attack of LLMs
ACL 2025
Efficient Universal Goal Hijacking with Semantics-guided Prompt Organization
ACL 2025
The TIP of the Iceberg: Revealing a Hidden Class of Task-in-Prompt Adversarial Attacks on LLMs
ACL 2025
AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection
ACL 2025
VLSBench: Unveiling Visual Leakage in Multimodal Safety
ACL 2025
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
ACL 2025
Jailbreaking? One Step Is Enough!
ACL 2025
HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States
ACL 2025
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models
ACL 2025
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
ACL 2025
Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech?
ACL 2025
Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs
ACL 2025
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling
ACL 2025
AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models
ACL 2025
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
ACL 2025
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
ACL 2025
Silencing Empowerment, Allowing Bigotry: Auditing the Moderation of Hate Speech on Twitch
ACL 2025
MPO: Multilingual Safety Alignment via Reward Gap Optimization
ACL 2025
<
1
…
5
6
7
…
17
>