conftrace_

Artificial Intelligence › Core AI ›

Safety

414 papers

Papers per year

1

1

4

8

11

21

29

36

87

117

99

Papers

SAME: Safety-Aware Model Editing Guided by Safety Transformation ACL 2026

Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification ACL 2026

Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models ACL 2026

The Side Effects of Being Smart: Safety Risks in MLLMs’ Multi-Image Reasoning ACL 2026

AlignCultura: Towards Culturally Aligned Large Language Models? ACL 2026

Mitigating Safety Context Amnesia in Multimodal Reasoning Models via Intent-Guided Safety Reasoning ACL 2026

Please refuse to answer me! Mitigating Over-Refusal in Large Language Models via Adaptive Contrastive Decoding ACL 2026

Decoding-Unlearning: Fact Forgetting via Entropy-Guided Inference ACL 2026

Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study ACL 2026

CAP: Controllable Alignment Prompting for Unlearning in LLMs ACL 2026

Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms ACL 2026

Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain ACL 2026

Multimodal Safety Evaluation in Generative Agent Social Simulations ACL 2026

SafeMT: Multi-turn Safety for Multimodal Language Models ACL 2026

ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System ACL 2026

Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety ACL 2026

A Lightweight Explainable Guardrail for Prompt Safety ACL 2026

SHARP: Self-adaptive Harmful Category-aware Prompt Generation for Black-box Jailbreaking ACL 2026

COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs ACL 2026

Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts ACL 2026

Thesis Proposal: An Explainable Multimodal Framework for Detecting Harmful Content in Code-Switched Children’s Media ACL 2026

Knowledge Control for Responsible Generative AI: Bridging Academia, Industry, and Society ACL 2026

IPS: In-Prompt Process Supervision for Short Video Content Moderation ACL 2026

FinHarmBench: Financial Jailbreak Benchmark and Unsupervised Safety Fine-Tuning via Refusal Steering Distillation ACL 2026

Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling AAAI 2025