conftrace_

Artificial Intelligence › Core AI ›

Safety

414 papers

Papers per year

1

1

4

8

11

21

29

36

87

117

99

Papers

Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation ACL 2025

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges ACL 2025

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free ACL 2025

LongSafety: Evaluating Long-Context Safety of Large Language Models ACL 2025

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights ACL 2025

PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference ACL 2025

Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMs ACL 2025

NLP for Counterspeech against Hate and Misinformation (CSHAM) ACL 2025

Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications ACL 2025

Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks ACL 2025

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models ACL 2025

VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration ACL 2025

Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models ACL 2025

Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction ACL 2025

QueryAttack: Jailbreaking Aligned Large Language Models Using Structured Non-natural Query Language ACL 2025

Tell Me What You Don’t Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing ACL 2025

Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement ACL 2025

Error Detection in Medical Note through Multi Agent Debate ACL 2025

PL-Guard: Benchmarking Language Model Safety for Polish ACL 2025

Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency ACL 2025

Preventing Rogue Agents Improves Multi-Agent Collaboration ACL 2025

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet ACL 2025

QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety ACL 2025

Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone CVPR 2025

EntropyMark: Towards More Harmless Backdoor Watermark via Entropy-based Constraint for Open-source Dataset Copyright Protection CVPR 2025