Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Look Before You Leap: Enhance Attention and Vigilance Regarding Harmful Content with GuidelineLLM AAAI 2025

DiveR-CT: Diversity-enhanced Red Teaming Large Language Model Assistants with Relaxing Constraints AAAI 2025

Watch Out for Your Guidance on Generation! Exploring Conditional Backdoor Attacks against Large Language Models AAAI 2025

RepeatLeakage: Leak Prompts from Repeating as Large Language Model Is a Good Repeater AAAI 2025

Measuring Human and AI Values Based on Generative Psychometrics with Large Language Models AAAI 2025

CL-Attack: Textual Backdoor Attacks via Cross-Lingual Triggers AAAI 2025

LLM Agents Can Be Choice-Supportive Biased Evaluators: An Empirical Study AAAI 2025

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models AAAI 2025

Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models AAAI 2025

ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates AAAI 2025

JailPO: A Novel Black-Box Jailbreak Framework via Preference Optimization Against Aligned LLMs AAAI 2025

Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update AAAI 2025

Strong Empowered and Aligned Weak Mastered Annotation for Weak-to-Strong Generalization AAAI 2025

Retention Score: Quantifying Jailbreak Risks for Vision Language Models AAAI 2025

Exploring Intrinsic Alignments Within Text Corpus AAAI 2025

Data with High and Consistent Preference Difference Are Better for Reward Model AAAI 2025

Neurons to Words: A Novel Method for Automated Neural Network Interpretability and Alignment AAAI 2025

SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety AAAI 2025

Align-Pro: A Principled Approach to Prompt Optimization for LLM Alignment AAAI 2025

Towards a Theory of AI Personhood AAAI 2025

Aligning Large Language Models for Faithful Integrity Against Opposing Argument AAAI 2025

CALM: Curiosity-Driven Auditing for Large Language Models AAAI 2025

Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback AAAI 2025

RTP-LX: Can LLMs Evaluate Toxicity in Multilingual Scenarios? AAAI 2025

Revisiting Early Detection of Sexual Predators via Turn-level Optimization NAACL 2025