conftrace_

Artificial Intelligence › Core AI ›

AI Safety

2,972 papers

Papers per year

Papers

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs NIPS 2024

Fast Best-of-N Decoding via Speculative Rejection NIPS 2024

Pseudo-Private Data Guided Model Inversion Attacks NIPS 2024

MetaAligner: Towards Generalizable Multi-Objective Alignment of Language Models NIPS 2024

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models NIPS 2024

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition NIPS 2024

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs NIPS 2024

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks NIPS 2024

Unveiling the Bias Impact on Symmetric Moral Consistency of Large Language Models NIPS 2024

Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness NIPS 2024

WaveAttack: Asymmetric Frequency Obfuscation-based Backdoor Attacks Against Deep Neural Networks NIPS 2024

Simplifying Constraint Inference with Inverse Reinforcement Learning NIPS 2024

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models NIPS 2024

Expert-level protocol translation for self-driving labs NIPS 2024

Intruding with Words: Towards Understanding Graph Injection Attacks at the Text Level NIPS 2024

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization NIPS 2024

Achieving Domain-Independent Certified Robustness via Knowledge Continuity NIPS 2024

Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor NIPS 2024

The Art of Saying No: Contextual Noncompliance in Language Models NIPS 2024

Diffusion Models are Certifiably Robust Classifiers NIPS 2024

Selective Generation for Controllable Language Models NIPS 2024

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence NIPS 2024

Predicting Future Actions of Reinforcement Learning Agents NIPS 2024

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision NIPS 2024

Calibrated Self-Rewarding Vision Language Models NIPS 2024