Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

GumbelSoft: Diversified Language Model Watermarking via the GumbelMax-trick ACL 2024

Relying on the Unreliable: The Impact of Language Models’ Reluctance to Express Uncertainty ACL 2024

More than Minorities and Majorities: Understanding Multilateral Bias in Language Generation ACL 2024

Defending Against Weight-Poisoning Backdoor Attacks for Parameter-Efficient Fine-Tuning NAACL 2024

Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking NAACL 2024

Achieving Domain-Independent Certified Robustness via Knowledge Continuity NIPS 2024

Truthful High Dimensional Sparse Linear Regression NIPS 2024

Improving Alignment and Robustness with Circuit Breakers NIPS 2024

Provably Safe Neural Network Controllers via Differential Dynamic Logic NIPS 2024

Measuring Goal-Directedness NIPS 2024

When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search NIPS 2024

A theoretical case-study of Scalable Oversight in Hierarchical Reinforcement Learning NIPS 2024

Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning NIPS 2024

Learning Human-like Representations to Enable Learning Human Values NIPS 2024

Simplifying Constraint Inference with Inverse Reinforcement Learning NIPS 2024

Everyday Object Meets Vision-and-Language Navigation Agent via Backdoor NIPS 2024

Predicting Future Actions of Reinforcement Learning Agents NIPS 2024

RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learning NIPS 2024

Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint) AAAI 2024

Secure Your Model: An Effective Key Prompt Protection Mechanism for Large Language Models NAACL 2024

Safe Linear Bandits over Unknown Polytopes COLT 2024

On the Computability of Robust PAC Learning COLT 2024

Language Models are Homer Simpson! Safety Re-Alignment of Fine-tuned Language Models through Task Arithmetic ACL 2024

Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain NAACL 2024

The power of an adversary in Glauber dynamics COLT 2024