conftrace
_
Papers
Trends
Conferences
Explore
More
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
414 papers
Papers per year
2016: 1
1
2017: 1
1
2018: 4
4
2019: 8
8
2020: 11
11
2021: 21
21
2022: 29
29
2023: 36
36
2024: 87
87
2025: 117
117
2026: 99
99
Papers
SAME: Safety-Aware Model Editing Guided by Safety Transformation
ACL 2026
Projecting Out the Malice: A Global Subspace Approach to LLM Detoxification
ACL 2026
Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models
ACL 2026
The Side Effects of Being Smart: Safety Risks in MLLMs’ Multi-Image Reasoning
ACL 2026
AlignCultura: Towards Culturally Aligned Large Language Models?
ACL 2026
Mitigating Safety Context Amnesia in Multimodal Reasoning Models via Intent-Guided Safety Reasoning
ACL 2026
Please refuse to answer me! Mitigating Over-Refusal in Large Language Models via Adaptive Contrastive Decoding
ACL 2026
Decoding-Unlearning: Fact Forgetting via Entropy-Guided Inference
ACL 2026
Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study
ACL 2026
CAP: Controllable Alignment Prompting for Unlearning in LLMs
ACL 2026
Rendering Data Unlearnable by Exploiting LLM Alignment Mechanisms
ACL 2026
Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain
ACL 2026
Multimodal Safety Evaluation in Generative Agent Social Simulations
ACL 2026
SafeMT: Multi-turn Safety for Multimodal Language Models
ACL 2026
ARES: Adaptive Red-Teaming and End-to-End Repair of Policy-Reward System
ACL 2026
Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety
ACL 2026
A Lightweight Explainable Guardrail for Prompt Safety
ACL 2026
SHARP: Self-adaptive Harmful Category-aware Prompt Generation for Black-box Jailbreaking
ACL 2026
COMPASS: A Framework for Evaluating Organization-Specific Policy Alignment in LLMs
ACL 2026
Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts
ACL 2026
Thesis Proposal: An Explainable Multimodal Framework for Detecting Harmful Content in Code-Switched Children’s Media
ACL 2026
Knowledge Control for Responsible Generative AI: Bridging Academia, Industry, and Society
ACL 2026
IPS: In-Prompt Process Supervision for Short Video Content Moderation
ACL 2026
FinHarmBench: Financial Jailbreak Benchmark and Unsupervised Safety Fine-Tuning via Refusal Steering Distillation
ACL 2026
Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling
AAAI 2025
<
1
2
3
4
5
…
17
>