Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training EMNLP 2025

Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages EMNLP 2025

Multilingual Text-to-Image Generation Magnifies Gender Stereotypes ACL 2025

Detoxify-IT: An Italian Parallel Dataset for Text Detoxification ACL 2025

Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation EMNLP 2025

Extended Abstract: Probing-Guided Parameter-Efficient Fine-Tuning for Balancing Linguistic Adaptation and Safety in LLM-based Social Influence Systems ACL 2025

Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective ACL 2025

On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs EMNLP 2025

Towards Trustworthy Summarization of Cardiovascular Articles: A Factuality-and-Uncertainty-Aware Biomedical LLM Approach EMNLP 2025

Adversarial Preference Learning for Robust LLM Alignment ACL 2025

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights ACL 2025

Model Rake: A Defense Against Stealing Attacks in Split Learning IJCAI 2025

LongSafety: Evaluating Long-Context Safety of Large Language Models ACL 2025

LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion ACL 2025

Neuron Similarity-Based Neural Network Verification via Abstraction and Refinement IJCAI 2025

Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated ACL 2025

Human-AI Moral Judgment Congruence on Real-World Scenarios: A Cross-Lingual Analysis EMNLP 2025

No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models EMNLP 2025

PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free ACL 2025

MEraser: An Effective Fingerprint Erasure Approach for Large Language Models ACL 2025

Formal Synthesis of Safe Kolmogorov-Arnold Network Controllers with Barrier Certificates IJCAI 2025

A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive ACL 2025

SDD: Self-Degraded Defense against Malicious Fine-tuning ACL 2025

Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data CVPR 2025

Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising CVPR 2025