Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Formal Synthesis of Safe Kolmogorov-Arnold Network Controllers with Barrier Certificates IJCAI 2025

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free ACL 2025

ProcessBench: Identifying Process Errors in Mathematical Reasoning ACL 2025

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion ACL 2025

Neuron Similarity-Based Neural Network Verification via Abstraction and Refinement IJCAI 2025

LongSafety: Evaluating Long-Context Safety of Large Language Models ACL 2025

LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint ACL 2025

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights ACL 2025

Model Rake: A Defense Against Stealing Attacks in Split Learning IJCAI 2025

Adversarial Preference Learning for Robust LLM Alignment ACL 2025

Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures ACL 2025

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective ACL 2025

COVER: Context-Driven Over-Refusal Verification in LLMs ACL 2025

Extended Abstract: Probing-Guided Parameter-Efficient Fine-Tuning for Balancing Linguistic Adaptation and Safety in LLM-based Social Influence Systems ACL 2025

Proxy Barrier: A Hidden Repeater Layer Defense Against System Prompt Leakage and Jailbreaking EMNLP 2025

Detoxify-IT: An Italian Parallel Dataset for Text Detoxification ACL 2025

Unraveling Misinformation Propagation in LLM Reasoning EMNLP 2025

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes ACL 2025

Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique EMNLP 2025

Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions ACL 2025

Backdoor Attack on Propagation-based Rumor Detectors AAAI 2025

Can GPTZero’s AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays? ACL 2025

Can You Trick the Grader? Adversarial Persuasion of LLM Judges EMNLP 2025

Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack ACL 2025

LongSafety: Enhance Safety for Long-Context LLMs ACL 2025