conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
414 papers
Papers per year
2016: 1
1
2017: 1
1
2018: 4
4
2019: 8
8
2020: 11
11
2021: 21
21
2022: 29
29
2023: 36
36
2024: 87
87
2025: 117
117
2026: 99
99
Papers
Red Teaming Large Reasoning Models
ACL 2026
Detoxification for LLM: From Dataset Itself
ACL 2026
Safeguarding LLM Fine-tuning via Push-Pull Distributional Alignment
ACL 2026
SafetyMem: Adaptive Jailbreak Defense via Dual-Component Safety Memory
ACL 2026
Two Pathways to Truthfulness: On the Intrinsic Encoding of LLM Hallucinations
ACL 2026
Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens
ACL 2026
CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks
ACL 2026
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
ACL 2026
TriPlay-RL: Tri-Role Self-Play Reinforcement Learning for LLM Safety Alignment
ACL 2026
Inertia in Moral and Value Judgments of Large Language Models
ACL 2026
OASIS: Mitigating Harmful Fine-tuning Attacks on LLMs via Orthogonal and Adaptive Safety Alignment Strategy
ACL 2026
N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator
ACL 2026
Detecting What Queries Seek: Steering LLM Safety with FFN Output Activation Monitoring
ACL 2026
Explainable and Fine-Grained Safeguarding of LLM Multi-Agent Systems via Bi-Level Graph Anomaly Detection
ACL 2026
Evaluating Answer Leakage Robustness of LLM Tutors against Adversarial Student Attacks
ACL 2026
TAMAS: Benchmarking Adversarial Risks in Multi-Agent LLM Systems
ACL 2026
How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities
ACL 2026
HiddenGuard: Fine-Grained Safe Generation with Specialized Representation Router
ACL 2026
Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward
ACL 2026
EthicMind: A Risk-Aware Framework for Ethical-Emotional Alignment in Multi-Turn Dialogue
ACL 2026
Resolving the Security-Auditability Dilemma with Auditable Latent Chain-of-Thought Alignment
ACL 2026
More Thinking, Less Talking: Internalizing Deliberative Safety into LLM Parameters
ACL 2026
Confident, Calibrated, or Complicit: Safety Alignment and Ideological Bias in LLM Hate Speech Detection
ACL 2026
Can LLM Safety Be Ensured by Constraining Parameter Regions?
ACL 2026
Reinforcement Learning–Guided Adaptive Tuning for Out-of-Distribution Harmful Text Detection
ACL 2026
<
1
2
3
4
5
…
17
>