conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
414 papers
Papers per year
2016: 1
1
2017: 1
1
2018: 4
4
2019: 8
8
2020: 11
11
2021: 21
21
2022: 29
29
2023: 36
36
2024: 87
87
2025: 117
117
2026: 99
99
Papers
Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation
ACL 2025
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
ACL 2025
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free
ACL 2025
LongSafety: Evaluating Long-Context Safety of Large Language Models
ACL 2025
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
ACL 2025
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
ACL 2025
Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMs
ACL 2025
NLP for Counterspeech against Hate and Misinformation (CSHAM)
ACL 2025
Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications
ACL 2025
Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks
ACL 2025
SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models
ACL 2025
VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration
ACL 2025
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models
ACL 2025
Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction
ACL 2025
QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language
ACL 2025
Tell Me What You Don’t Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing
ACL 2025
Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement
ACL 2025
Error Detection in Medical Note through Multi Agent Debate
ACL 2025
PL-Guard: Benchmarking Language Model Safety for Polish
ACL 2025
Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency
ACL 2025
Preventing Rogue Agents Improves Multi-Agent Collaboration
ACL 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
ACL 2025
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
ACL 2025
Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone
CVPR 2025
EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection
CVPR 2025
<
1
…
6
7
8
…
17
>