conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
414 papers
Papers per year
2016: 1
1
2017: 1
1
2018: 4
4
2019: 8
8
2020: 11
11
2021: 21
21
2022: 29
29
2023: 36
36
2024: 87
87
2025: 117
117
2026: 99
99
Papers
Subtle Risks, Critical Failures: A Framework for Diagnosing Physical Safety of LLMs for Embodied Decision Making
EMNLP 2025
Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers
EMNLP 2025
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety
EMNLP 2025
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking
EMNLP 2025
Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study
EMNLP 2025
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
EMNLP 2025
Certified Mitigation of Worst-Case LLM Copyright Infringement
EMNLP 2025
Taxonomy of Comprehensive Safety for Clinical Agents
EMNLP 2025
HalluDetect: Detecting, Mitigating, and Benchmarking Hallucinations in Conversational Systems in the Legal Domain
EMNLP 2025
How to Fine-Tune Safely on a Budget: Model Adaptation Using Minimal Resources
EMNLP 2025
VestaBench: An Embodied Benchmark for Safe Long-Horizon Planning Under Multi-Constraint and Adversarial Settings
EMNLP 2025
sudoLLM: On Multi-role Alignment of Language Models
EMNLP 2025
English as Defense Proxy: Mitigating Multilingual Jailbreak via Eliciting English Safety Knowledge
EMNLP 2025
Confusion is the Final Barrier: Rethinking Jailbreak Evaluation and Investigating the Real Misuse Threat of LLMs
EMNLP 2025
The Hallucination Tax of Reinforcement Finetuning
EMNLP 2025
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
EMNLP 2025
Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning
NIPS 2024
MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
NIPS 2024
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models
NIPS 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
NIPS 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
NIPS 2024
Simplifying Constraint Inference with Inverse Reinforcement Learning
NIPS 2024
The Art of Saying No: Contextual Noncompliance in Language Models
NIPS 2024
Unelicitable Backdoors via Cryptographic Transformer Circuits
NIPS 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
NIPS 2024
<
1
…
8
9
10
…
17
>