Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Comparing Moral Values in Western English-speaking societies and LLMs with Word Associations
ACL 2025
Tightening Robustness Verification of MaxPool-based Neural Networks via Minimizing the Over-Approximation Zone
CVPR 2025
Vulnerability of Large Language Models to Output Prefix Jailbreaks: Impact of Positions on Safety
NAACL 2025
Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis
NAACL 2025
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
ACL 2025
Safe Inputs but Unsafe Output: Benchmarking Cross-modality Safety Alignment of Large Vision-Language Models
NAACL 2025
Multilingual Blending: Large Language Model Safety Alignment Evaluation with Language Mixture
NAACL 2025
Cultural Learning-Based Culture Adaptation of Language Models
ACL 2025
Investigating Motivated Inference in Large Language Models
EMNLP 2025
Test-Time Backdoor Detection for Object Detection Models
CVPR 2025
Challenges in Trustworthy Human Evaluation of Chatbots
NAACL 2025
Atoxia: Red-teaming Large Language Models with Target Toxic Answers
NAACL 2025
Mixture of insighTful Experts (MoTE): The Synergy of Reasoning Chains and Expert Mixtures in Self-Alignment
ACL 2025
Towards Trustworthy Summarization of Cardiovascular Articles: A Factuality-and-Uncertainty-Aware Biomedical LLM Approach
EMNLP 2025
Con Instruction: Universal Jailbreaking of Multimodal Large Language Models via Non-Textual Modalities
ACL 2025
Human-AI Moral Judgment Congruence on Real-World Scenarios: A Cross-Lingual Analysis
EMNLP 2025
SMLE: Safe Machine Learning via Embedded Overapproximation
AAAI 2025
RESF: Regularized-Entropy-Sensitive Fingerprinting for Black-Box Tamper Detection of Large Language Models
EMNLP 2025
Ensemble Watermarks for Large Language Models
ACL 2025
Are Stereotypes Leading LLMs’ Zero-Shot Stance Detection ?
EMNLP 2025
Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation
EMNLP 2025
uMedSum: A Unified Framework for Clinical Abstractive Summarization
ACL 2025
MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training
EMNLP 2025
Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages
EMNLP 2025
RLAIF-V: Open-Source AI Feedback Leads to Super GPT-4V Trustworthiness
CVPR 2025
<
1
…
24
25
26
…
119
>