Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Steering Llama 2 via Contrastive Activation Addition ACL 2024

Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack NIPS 2024

InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents ACL 2024

Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge ACL 2024

BadActs: A Universal Backdoor Defense in the Activation Space ACL 2024

RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors ACL 2024

Extreme Miscalibration and the Illusion of Adversarial Robustness ACL 2024

How Susceptible are Large Language Models to Ideological Manipulation? EMNLP 2024

Don’t Just Say “I don’t know”! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations EMNLP 2024

Expert-level protocol translation for self-driving labs NIPS 2024

Effective Ambiguity Attack Against Passport-Based DNN Intellectual Property Protection Schemes Through Fully Connected Layer Substitution CVPR 2023

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model NIPS 2023

RealBehavior: A Framework for Faithfully Characterizing Foundation Models’ Human-like Behavior Mechanisms EMNLP 2023

The Resource Problem of Using Linear Layer Leakage Attack in Federated Learning CVPR 2023

Utterance Classification with Logical Neural Network: Explainable AI for Mental Disorder Diagnosis ACL 2023

NOTABLE: Transferable Backdoor Attacks Against Prompt-based NLP Models ACL 2023

Query-Efficient Black-Box Red Teaming via Bayesian Optimization ACL 2023

Improving the robustness of NLI models with minimax training ACL 2023

SafeConv: Explaining and Correcting Conversational Unsafe Behavior ACL 2023

Holistic Adversarial Robustness of Deep Learning Models AAAI 2023

Towards Safe AI: Sandboxing DNNs-Based Controllers in Stochastic Games AAAI 2023

Reachability Analysis of Neural Network Control Systems AAAI 2023

DeepGemini: Verifying Dependency Fairness for Deep Neural Network AAAI 2023

Beyond NaN: Resiliency of Optimization Layers in the Face of Infeasibility AAAI 2023

Towards Verifying the Geometric Robustness of Large-Scale Neural Networks AAAI 2023