Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Covert Bias: The Severity of Social Views’ Unalignment in Language Models Towards Implicit and Explicit Opinion EMNLP 2024

Active Learning for Robust and Representative LLM Generation in Safety-Critical Scenarios EMNLP 2024

Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning EMNLP 2024

WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions EMNLP 2024

LLM Internal States Reveal Hallucination Risk Faced With a Query EMNLP 2024

Think Twice Before Trusting: Self-Detection for Large Language Models through Comprehensive Answer Reflection EMNLP 2024

Virtual Context Enhancing Jailbreak Attacks with Special Token Injection EMNLP 2024

NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding EMNLP 2024

Self-Evolution Fine-Tuning for Policy Optimization EMNLP 2024

LLMGuard: Guarding against Unsafe LLM Behavior AAAI 2024

Evaluating AI Red Teaming’s Readiness to Address Environmental Harms: A Thematic Analysis of LLM Discourse AAAI 2024

Autonomous Policy Explanations for Effective Human-Machine Teaming AAAI 2024

The Generalization and Robustness of Transformer-Based Language Models on Commonsense Reasoning AAAI 2024

Dr. R.O. Bott Will See You Now: Exploring AI for Wellbeing with Middle School Students AAAI 2024

Detecting AI-Generated Code Assignments Using Perplexity of Large Language Models AAAI 2024

Visual Adversarial Examples Jailbreak Aligned Large Language Models AAAI 2024

SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models AAAI 2024

MathAttack: Attacking Large Language Models towards Math Solving Ability AAAI 2024

Preference Ranking Optimization for Human Alignment AAAI 2024

Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents NIPS 2024

Normative Testimony and Belief Functions: A Formal Theory of Norm Learning IJCAI 2024

ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users NIPS 2024

Hyper-opinion Evidential Deep Learning for Out-of-Distribution Detection NIPS 2024

LLM Evaluators Recognize and Favor Their Own Generations NIPS 2024

DALD: Improving Logits-based Detector without Logits from Black-box LLMs NIPS 2024