Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Prompt-Guided Internal States for Hallucination Detection of Large Language Models ACL 2025

Crossfire: An Elastic Defense Framework for Graph Neural Networks Under Bit Flip Attacks AAAI 2025

Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception ACL 2025

Backdoor Attack on Propagation-based Rumor Detectors AAAI 2025

LLMs know their vulnerabilities: Uncover Safety Gaps through Natural Distribution Shifts ACL 2025

Towards Computational Foreseeability AAAI 2025

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models ACL 2025

Grimm: A Plug-and-Play Perturbation Rectifier for Graph Neural Networks Defending Against Poisoning Attacks AAAI 2025

Safety Alignment via Constrained Knowledge Unlearning ACL 2025

Probabilistic Shielding for Safe Reinforcement Learning AAAI 2025

Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems ACL 2025

Tilted Quantile Gradient Updates for Quantile-Constrained Reinforcement Learning AAAI 2025

Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation ACL 2025

Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring NAACL 2025

Internal Value Alignment in Large Language Models through Controlled Value Vector Activation ACL 2025

Protecting Model Adaptation from Trojans in the Unlabeled Data AAAI 2025

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges ACL 2025

Extracting and Understanding the Superficial Knowledge in Alignment NAACL 2025

Know Your Mistakes: Towards Preventing Overreliance on Task-Oriented Conversational AI Through Accountability Modeling ACL 2025

COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems Against Semantic Attacks AAAI 2025

SDD: Self-Degraded Defense against Malicious Fine-tuning ACL 2025

SEAL: Systematic Error Analysis for Value ALignment AAAI 2025

A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive ACL 2025

Quantitative Predictive Monitoring and Control for Safe Human-Machine Interaction AAAI 2025

LongSafety: Enhance Safety for Long-Context LLMs ACL 2025