conftrace_

Artificial Intelligence › Core AI ›

Safety

414 papers

Papers per year

1

1

4

8

11

21

29

36

87

117

99

Papers

Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making EMNLP 2025

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers EMNLP 2025

MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety EMNLP 2025

Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking EMNLP 2025

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study EMNLP 2025

Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification EMNLP 2025

Certified Mitigation of Worst-Case LLM Copyright Infringement EMNLP 2025

Taxonomy of Comprehensive Safety for Clinical Agents EMNLP 2025

HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems in the Legal Domain EMNLP 2025

How to Fine-Tune Safely on a Budget: Model Adaptation Using Minimal Resources EMNLP 2025

VestaBench: An Embodied Benchmark for Safe Long-Horizon Planning Under Multi-Constraint and Adversarial Settings EMNLP 2025

sudoLLM: On Multi-role Alignment of Language Models EMNLP 2025

English as Defense Proxy: Mitigating Multilingual Jailbreak via Eliciting English Safety Knowledge EMNLP 2025

Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs EMNLP 2025

The Hallucination Tax of Reinforcement Finetuning EMNLP 2025

Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation EMNLP 2025

Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning NIPS 2024

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models NIPS 2024

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models NIPS 2024

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition NIPS 2024

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs NIPS 2024

Simplifying Constraint Inference with Inverse Reinforcement Learning NIPS 2024

The Art of Saying No: Contextual Noncompliance in Language Models NIPS 2024

Unelicitable Backdoors via Cryptographic Transformer Circuits NIPS 2024

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models NIPS 2024