Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges ACL 2025

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents ACL 2025

Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs ACL 2025

Are Bias Evaluation Methods Biased ? ACL 2025

Improving Factuality with Explicit Working Memory ACL 2025

Language Models Resist Alignment: Evidence From Data Compression ACL 2025

SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings ACL 2025

Removing Prompt-template Bias in Reinforcement Learning from Human Feedback ACL 2025

The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination ACL 2025

PL-Guard: Benchmarking Language Model Safety for Polish ACL 2025

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs ACL 2025

Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems ACL 2025

Blinded by Context: Unveiling the Halo Effect of MLLM in AI Hiring ACL 2025

PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration ACL 2025

ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models ACL 2025

Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated ACL 2025

Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts ICCV 2025

InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes Under Herd Behavior ACL 2025

Prototype Guided Backdoor Defense via Activation Space Manipulation ICCV 2025

DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing ICCV 2025

PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization ACL 2025

FLUE: Streamlined Uncertainty Estimation for Large Language Models AAAI 2025

Improved Unbiased Watermark for Large Language Models ACL 2025

PLA: Prompt Learning Attack against Text-to-Image Generative Models ICCV 2025

Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention ICCV 2025