conftrace
_
Papers
Trends
Conferences
Explore
Authors
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2,972 papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
NIPS 2024
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
ACL 2024
Simplifying Constraint Inference with Inverse Reinforcement Learning
NIPS 2024
The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning
AAAI 2024
Efficient Adversarial Training in LLMs with Continuous Attacks
NIPS 2024
Backdooring Instruction-Tuned Large Language Models with Virtual Prompt Injection
NAACL 2024
Learning Safety Constraints from Demonstrations with Unknown Rewards
AISTATS 2024
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
NIPS 2024
A theoretical case-study of Scalable Oversight in Hierarchical Reinforcement Learning
NIPS 2024
Compos Mentis at SemEval2024 Task6: A Multi-Faceted Role-based Large Language Model Ensemble to Detect Hallucination
SEMEVAL 2024
Towards a Unified Framework for Adaptable Problematic Content Detection via Continual Learning
NAACL 2024
BadActs: A Universal Backdoor Defense in the Activation Space
ACL 2024
Improving Alignment and Robustness with Circuit Breakers
NIPS 2024
Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution
EMNLP 2024
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
EMNLP 2024
DarkFed: A Data-Free Backdoor Attack in Federated Learning
IJCAI 2024
SELF-GUARD: Empower the LLM to Safeguard Itself
NAACL 2024
Conditional Backdoor Attack via JPEG Compression
AAAI 2024
Fast Best-of-N Decoding via Speculative Rejection
NIPS 2024
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
NAACL 2024
LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning
CVPR 2024
Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration
NIPS 2024
Raccoon: Prompt Extraction Benchmark of LLM-Integrated Applications
ACL 2024
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference
EMNLP 2024
Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models
NAACL 2024
<
1
…
56
57
58
…
119
>