conftrace
_
Papers
Trends
Conferences
Explore
Authors
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2,972 papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
NIPS 2024
Fast Best-of-N Decoding via Speculative Rejection
NIPS 2024
Pseudo-Private Data Guided Model Inversion Attacks
NIPS 2024
MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models
NIPS 2024
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models
NIPS 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
NIPS 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
NIPS 2024
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
NIPS 2024
Unveiling the Bias Impact on Symmetric Moral Consistency of Large Language Models
NIPS 2024
Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness
NIPS 2024
WaveAttack: Asymmetric Frequency Obfuscation-based Backdoor Attacks Against Deep Neural Networks
NIPS 2024
Simplifying Constraint Inference with Inverse Reinforcement Learning
NIPS 2024
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
NIPS 2024
Expert-level protocol translation for self-driving labs
NIPS 2024
Intruding with Words: Towards Understanding Graph Injection Attacks at the Text Level
NIPS 2024
Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization
NIPS 2024
Achieving Domain-Independent Certified Robustness via Knowledge Continuity
NIPS 2024
Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor
NIPS 2024
The Art of Saying No: Contextual Noncompliance in Language Models
NIPS 2024
Diffusion Models are Certifiably Robust Classifiers
NIPS 2024
Selective Generation for Controllable Language Models
NIPS 2024
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
NIPS 2024
Predicting Future Actions of Reinforcement Learning Agents
NIPS 2024
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision
NIPS 2024
Calibrated Self-Rewarding Vision Language Models
NIPS 2024
<
1
…
55
56
57
…
119
>