Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
MEraser: An Effective Fingerprint Erasure Approach for Large Language Models
ACL 2025
Harry Potter is Still Here! Probing Knowledge Leakage in Targeted Unlearned Large Language Models
EMNLP 2025
Formal Synthesis of Safe Kolmogorov-Arnold Network Controllers with Barrier Certificates
IJCAI 2025
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
CVPR 2025
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free
ACL 2025
ADU: Adaptive Detection of Unknown Categories in Black-Box Domain Adaptation
CVPR 2025
Towards Effective and Sparse Adversarial Attack on Spiking Neural Networks via Breaking Invisible Surrogate Gradients
CVPR 2025
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models
CVPR 2025
LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion
ACL 2025
Your Semantic-Independent Watermark is Fragile: A Semantic Perturbation Attack against EaaS Watermark
EMNLP 2025
Neuron Similarity-Based Neural Network Verification via Abstraction and Refinement
IJCAI 2025
RuleR: Improving LLM Controllability by Rule-based Data Recycling
NAACL 2025
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
ACL 2025
LongSafety: Evaluating Long-Context Safety of Large Language Models
ACL 2025
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
ACL 2025
MuMA-ToM: Multi-modal Multi-Agent Theory of Mind
AAAI 2025
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
ACL 2025
ProcessBench: Identifying Process Errors in Mathematical Reasoning
ACL 2025
Jailbreak Large Vision-Language Models Through Multi-Modal Linkage
ACL 2025
Oversight Structures for Agentic AI in Public-Sector Organizations
ACL 2025
Chain-of-Jailbreak Attack for Image Generation Models via Step by Step Editing
ACL 2025
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
ACL 2025
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models
ACL 2025
Improved Unbiased Watermark for Large Language Models
ACL 2025
DIESEL: A Lightweight Inference-Time Safety Enhancement for Language Models
ACL 2025
<
1
…
38
39
40
…
119
>