Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations ACL 2025

Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone CVPR 2025

Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety NAACL 2025

Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis NAACL 2025

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training ACL 2025

Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Models NAACL 2025

Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture NAACL 2025

Cultural Learning-Based Culture Adaptation of Language Models ACL 2025

Investigating Motivated Inference in Large Language Models EMNLP 2025

Test-Time Backdoor Detection for Object Detection Models CVPR 2025

Challenges in Trustworthy Human Evaluation of Chatbots NAACL 2025

Atoxia: Red-teaming Large Language Models with Target Toxic Answers NAACL 2025

Mixture of insighTful Experts (MoTE): The Synergy of Reasoning Chains and Expert Mixtures in Self-Alignment ACL 2025

Towards Trustworthy Summarization of Cardiovascular Articles: A Factuality-and-Uncertainty-Aware Biomedical LLM Approach EMNLP 2025

Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities ACL 2025

Human-AI Moral Judgment Congruence on Real-World Scenarios: A Cross-Lingual Analysis EMNLP 2025

SMLE: Safe Machine Learning via Embedded Overapproximation AAAI 2025

RESF: Regularized-Entropy-Sensitive Fingerprinting for Black-Box Tamper Detection of Large Language Models EMNLP 2025

Ensemble Watermarks for Large Language Models ACL 2025

Are Stereotypes Leading LLMs’ Zero-Shot Stance Detection ? EMNLP 2025

Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation EMNLP 2025

uMedSum: A Unified Framework for Clinical Abstractive Summarization ACL 2025

MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training EMNLP 2025

Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages EMNLP 2025

RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness CVPR 2025