Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Why Safeguarded Ships Run Aground? Aligned Large Language Models’ Safety Mechanisms Tend to Be Anchored in The Template Region
ACL 2025
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
ACL 2025
Towards a Theory of AI Personhood
AAAI 2025
CALM: Curiosity-Driven Auditing for Large Language Models
AAAI 2025
Certified Trustworthiness in the Era of Large Language Models
AAAI 2025
Combating Phone Scams with LLM-based Detection: Where Do We Stand? (Student Abstract)
AAAI 2025
Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency
ACL 2025
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
ACL 2025
1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning
ACL 2025
X-Guard: Multilingual Guard Agent for Content Moderation
ACL 2025
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models
ACL 2025
Data with High and Consistent Preference Difference Are Better for Reward Model
AAAI 2025
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving
ACL 2025
Guardians of Trust: Risks and Opportunities for LLMs in Mental Health
ACL 2025
What Counts Underlying LLMs’ Moral Dilemma Judgments?
ACL 2025
Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective
ACL 2025
Safe in Isolation, Dangerous Together: Agent-Driven Multi-Turn Decomposition Jailbreaks on LLMs
ACL 2025
iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss
ACL 2025
AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking
ACL 2025
SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
ACL 2025
SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models
ACL 2025
PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust
ACL 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
ACL 2025
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
ACL 2025
LongSafety: Enhance Safety for Long-Context LLMs
ACL 2025
<
1
…
16
17
18
…
119
>