conftrace_

Artificial Intelligence › Core AI ›

AI Safety

3,026 papers

Papers per year

1

1

1

4

1

5

1

13

40

91

111

181

204

333

642

1031

366

'15

'20

'25

Papers

Here’s a Free Lunch: Sanitizing Backdoored Models with Model Merge ACL 2024

From Representational Harms to Quality-of-Service Harms: A Case Study on Llama 2 Safety Safeguards ACL 2024

Defending LLMs against Jailbreaking Attacks via Backtranslation ACL 2024

Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models ACL 2024

Do Clinicians Know How to Prompt? The Need for Automatic Prompt Optimization Help in Clinical Note Generation ACL 2024

Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders ACL 2024

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs ACL 2024

An Analysis of Tasks and Datasets in Peer Reviewing ACL 2024

Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning ACL 2024

Pixel-wise Smoothing for Certified Robustness against Camera Motion Perturbations AISTATS 2024

Adaptive Experiment Design with Synthetic Controls AISTATS 2024

Analyzing Explainer Robustness via Probabilistic Lipschitzness of Prediction Functions AISTATS 2024

Taming False Positives in Out-of-Distribution Detection with Human Feedback AISTATS 2024

Ethics in Action: Training Reinforcement Learning Agents for Moral Decision-making In Text-based Adventure Games AISTATS 2024

Learning Safety Constraints from Demonstrations with Unknown Rewards AISTATS 2024

Optimal Zero-Shot Detector for Multi-Armed Attacks AISTATS 2024

Formal Verification of Unknown Stochastic Systems via Non-parametric Estimation AISTATS 2024

How does GPT-2 Predict Acronyms? Extracting and Understanding a Circuit via Mechanistic Interpretability AISTATS 2024

Sampling-based Safe Reinforcement Learning for Nonlinear Dynamical Systems AISTATS 2024

Non-vacuous Generalization Bounds for Adversarial Risk in Stochastic Neural Networks AISTATS 2024

Tight Verification of Probabilistic Robustness in Bayesian Neural Networks AISTATS 2024

ADEA: An Argumentative Dialogue Dataset on Ethical Issues Concerning Future A.I. Applications COLING 2024

Backdoor NLP Models via AI-Generated Text COLING 2024

Detection, Diagnosis, and Explanation: A Benchmark for Chinese Medical Hallucination Evaluation COLING 2024

How Susceptible Are LLMs to Logical Fallacies? COLING 2024