Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention ICCV 2025

PLA: Prompt Learning Attack against Text-to-Image Generative Models ICCV 2025

DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing ICCV 2025

Prototype Guided Backdoor Defense via Activation Space Manipulation ICCV 2025

Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts ICCV 2025

Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated ACL 2025

Blinded by Context: Unveiling the Halo Effect of MLLM in AI Hiring ACL 2025

Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems ACL 2025

PL-Guard: Benchmarking Language Model Safety for Polish ACL 2025

The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination ACL 2025

Are Bias Evaluation Methods Biased ? ACL 2025

Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs ACL 2025

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges ACL 2025

ELAB: Extensive LLM Alignment Benchmark in Persian Language ACL 2025

Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals ACL 2025

Can LLMs Recognize Their Own Analogical Hallucinations? Evaluating Uncertainty Estimation for Analogical Reasoning ACL 2025

Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates ACL 2025

UTF: Under-trained Tokens as Fingerprints —— a Novel Approach to LLM Identification ACL 2025

RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization ACL 2025

Using Humor to Bypass Safety Guardrails in Large Language Models ACL 2025

LongSafety: Enhance Safety for Long-Context LLMs ACL 2025

ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving ACL 2025

X-Guard: Multilingual Guard Agent for Content Moderation ACL 2025

1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning ACL 2025

Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency ACL 2025