Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training
EMNLP 2025
Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages
EMNLP 2025
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes
ACL 2025
Detoxify-IT: An Italian Parallel Dataset for Text Detoxification
ACL 2025
Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation
EMNLP 2025
Extended Abstract: Probing-Guided Parameter-Efficient Fine-Tuning for Balancing Linguistic Adaptation and Safety in LLM-based Social Influence Systems
ACL 2025
Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective
ACL 2025
On the Role of Unobserved Sequences on Sample-based Uncertainty Quantification for LLMs
EMNLP 2025
Towards Trustworthy Summarization of Cardiovascular Articles: A Factuality-and-Uncertainty-Aware Biomedical LLM Approach
EMNLP 2025
Adversarial Preference Learning for Robust LLM Alignment
ACL 2025
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
ACL 2025
Model Rake: A Defense Against Stealing Attacks in Split Learning
IJCAI 2025
LongSafety: Evaluating Long-Context Safety of Large Language Models
ACL 2025
LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion
ACL 2025
Neuron Similarity-Based Neural Network Verification via Abstraction and Refinement
IJCAI 2025
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
ACL 2025
Human-AI Moral Judgment Congruence on Real-World Scenarios: A Cross-Lingual Analysis
EMNLP 2025
No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models
EMNLP 2025
PIGuard: Prompt Injection Guardrail via Mitigating Overdefense for Free
ACL 2025
MEraser: An Effective Fingerprint Erasure Approach for Large Language Models
ACL 2025
Formal Synthesis of Safe Kolmogorov-Arnold Network Controllers with Barrier Certificates
IJCAI 2025
A Theory of Response Sampling in LLMs: Part Descriptive and Part Prescriptive
ACL 2025
SDD: Self-Degraded Defense against Malicious Fine-tuning
ACL 2025
Beyond Clean Training Data: A Versatile and Model-Agnostic Framework for Out-of-Distribution Detection with Contaminated Training Data
CVPR 2025
Jailbreaking the Non-Transferable Barrier via Test-Time Data Disguising
CVPR 2025
<
1
…
13
14
15
…
119
>