Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Guardians of Trust: Risks and Opportunities for LLMs in Mental Health
ACL 2025
What Counts Underlying LLMs’ Moral Dilemma Judgments?
ACL 2025
Inherent and emergent liability issues in LLM-based agentic systems: a principal-agent perspective
ACL 2025
Safe in Isolation, Dangerous Together: Agent-Driven Multi-Turn Decomposition Jailbreaks on LLMs
ACL 2025
iShumei-Chinchunmei at SemEval-2025 Task 4: A balanced forgetting and retention multi-task framework using effective unlearning loss
ACL 2025
AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking
ACL 2025
SemEval-2025 Task 3: Mu-SHROOM, the Multilingual Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
ACL 2025
SemEval-2025 Task 4: Unlearning sensitive content from Large Language Models
ACL 2025
PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust
ACL 2025
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
ACL 2025
QGuard:Question-based Zero-shot Guard for Multi-modal LLM Safety
ACL 2025
Detecting Child Objectification on Social Media: Challenges in Language Modeling
ACL 2025
Graph of Attacks with Pruning: Optimizing Stealthy Jailbreak Prompt. Generation for Enhanced LLM Content Moderation
ACL 2025
Red-Teaming for Uncovering Societal Bias in Large Language Models
ACL 2025
Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling
AAAI 2025
Contrasting Adversarial Perturbations: The Space of Harmless Perturbations
AAAI 2025
GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization
AAAI 2025
PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks
AAAI 2025
AUTE: Peer-Alignment and Self-Unlearning Boost Adversarial Robustness for Training Ensemble Models
AAAI 2025
AIM: Additional Image Guided Generation of Transferable Adversarial Attacks
AAAI 2025
Training Verification-Friendly Neural Networks via Neuron Behavior Consistency
AAAI 2025
Efficient Robustness Evaluation via Constraint Relaxation
AAAI 2025
First Line of Defense: A Robust First Layer Mitigates Adversarial Attacks
AAAI 2025
ADBA: Approximation Decision Boundary Approach for Black-Box Adversarial Attacks
AAAI 2025
Meme Trojan: Backdoor Attacks Against Hateful Meme Detection via Cross-Modal Triggers
AAAI 2025
<
1
…
49
50
51
…
119
>