conftrace_

Artificial Intelligence › Core AI ›

Safety

414 papers

Papers per year

1

1

4

8

11

21

29

36

87

117

99

Papers

Red Teaming Large Reasoning Models ACL 2026

Detoxification for LLM: From Dataset Itself ACL 2026

Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment ACL 2026

SafetyMem: Adaptive Jailbreak Defense via Dual-Component Safety Memory ACL 2026

Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations ACL 2026

Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens ACL 2026

CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks ACL 2026

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs ACL 2026

TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment ACL 2026

Inertia in Moral and Value Judgments of Large Language Models ACL 2026

OASIS: Mitigating Harmful Fine-tuning Attacks on LLMs via Orthogonal and Adaptive Safety Alignment Strategy ACL 2026

N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator ACL 2026

Detecting What Queries Seek: Steering LLM Safety with FFN Output Activation Monitoring ACL 2026

Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection ACL 2026

Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks ACL 2026

TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems ACL 2026

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities ACL 2026

HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router ACL 2026

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward ACL 2026

EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue ACL 2026

Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought Alignment ACL 2026

More Thinking, Less Talking: Internalizing Deliberative Safety into LLM Parameters ACL 2026

Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection ACL 2026

Can LLM Safety Be Ensured by Constraining Parameter Regions? ACL 2026

Reinforcement Learning–Guided Adaptive Tuning for Out-of-Distribution Harmful Text Detection ACL 2026