Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

I Don't Know: Explicit Modeling of Uncertainty with an [IDK] Token NIPS 2024

Representation Noising: A Defence Mechanism Against Harmful Finetuning NIPS 2024

Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs NIPS 2024

Fast Best-of-N Decoding via Speculative Rejection NIPS 2024

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs NIPS 2024

Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks NIPS 2024

Unveiling the Bias Impact on Symmetric Moral Consistency of Large Language Models NIPS 2024

Personalized Steering of Large Language Models: Versatile Steering Vectors Through Bi-directional Preference Optimization NIPS 2024

Selective Generation for Controllable Language Models NIPS 2024

CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence NIPS 2024

Calibrated Self-Rewarding Vision Language Models NIPS 2024

Stealth edits to large language models NIPS 2024

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models NIPS 2024

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models NIPS 2024

Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature NIPS 2024

NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security NIPS 2024

Impeding LLM-assisted Cheating in Introductory Programming Assignments via Adversarial Perturbation EMNLP 2024

On the Influence of Gender and Race in Romantic Relationship Prediction from Large Language Models EMNLP 2024

Evaluating the Instruction-Following Robustness of Large Language Models to Prompt Injection EMNLP 2024

Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections EMNLP 2024

AgentReview: Exploring Peer Review Dynamics with LLM Agents EMNLP 2024

Towards Tool Use Alignment of Large Language Models EMNLP 2024

Evaluating Psychological Safety of Large Language Models EMNLP 2024

Alignment-Enhanced Decoding: Defending Jailbreaks via Token-Level Adaptive Refining of Probability Distributions EMNLP 2024

Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal Mechanism EMNLP 2024