Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Granite Guardian: Comprehensive LLM Safeguarding
NAACL 2025
Improving Consistency in LLM Inference using Probabilistic Tokenization
NAACL 2025
Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations
ACL 2025
Certified Human Trajectory Prediction
CVPR 2025
CADRef: Robust Out-of-Distribution Detection via Class-Aware Decoupled Relative Feature Leveraging
CVPR 2025
Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?
CVPR 2025
A Practical Examination of AI-Generated Text Detectors for Large Language Models
NAACL 2025
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
NAACL 2025
Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models
AAAI 2025
Aligning to What? Limits to RLHF Based Alignment
NAACL 2025
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
ACL 2025
Tongue-Tied: Breaking LLMs Safety Through New Language Learning
NAACL 2025
Enhancing Transferability of Targeted Adversarial Examples via Inverse Target Gradient Competition and Spatial Distance Stretching
ICCV 2025
Mind the Trojan Horse: Image Prompt Adapter Enabling Scalable and Deceptive Jailbreaking
CVPR 2025
Black-Box Forgery Attacks on Semantic Watermarks for Diffusion Models
CVPR 2025
Decoder Gradient Shield: Provable and High-Fidelity Prevention of Gradient-Based Box-Free Watermark Removal
CVPR 2025
CIC-NLP@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
NAACL 2025
A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient
NAACL 2025
Stepwise Reasoning Disruption Attack of LLMs
ACL 2025
Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data
CVPR 2025
Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising
CVPR 2025
Beyond Human Judgment: A Bayesian Evaluation of LLMs’ Moral Values Understanding
EMNLP 2025
Demystify Verbosity Compensation Behavior of Large Language Models
EMNLP 2025
On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs
EMNLP 2025
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
NAACL 2025
<
1
…
29
30
31
…
119
>