Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Towards Trustworthy Summarization of Cardiovascular Articles: A Factuality-and-Uncertainty-Aware Biomedical LLM Approach
EMNLP 2025
Human-AI Moral Judgment Congruence on Real-World Scenarios: A Cross-Lingual Analysis
EMNLP 2025
MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training
EMNLP 2025
Investigating Motivated Inference in Large Language Models
EMNLP 2025
Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages
EMNLP 2025
No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models
EMNLP 2025
FLUE: Streamlined Uncertainty Estimation for Large Language Models
AAAI 2025
AIA: Autoregression-Based Injection Attacks Against Text2SQL Models
AAAI 2025
Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage
AAAI 2025
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering
AAAI 2025
Against All Odds: Overcoming Typology, Script, and Language Confusion in Multilingual Embedding Inversion Attacks
AAAI 2025
Attributive Reasoning for Hallucination Diagnosis of Large Language Models
AAAI 2025
Security Attacks on LLM-based Code Completion Tools
AAAI 2025
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
AAAI 2025
Multi-Turn Jailbreaking Large Language Models via Attention Shifting
AAAI 2025
Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models
AAAI 2025
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
AAAI 2025
Backdoor Token Unlearning: Exposing and Defending Backdoors in Pretrained Language Models
AAAI 2025
Task-Agnostic Language Model Watermarking via High Entropy Passthrough Layers
AAAI 2025
Tuning-Free Accountable Intervention for LLM Deployment – a Metacognitive Approach
AAAI 2025
Decoupling Metacognition from Cognition: A Framework for Quantifying Metacognitive Ability in LLMs
AAAI 2025
STLC-KG:A Social Text Steganalysis Method Combining Large-Scale Language Models and Common-Sense Knowledge Graphs
AAAI 2025
Mitigating Social Bias in Large Language Models: A Multi-Objective Approach Within a Multi-Agent Framework
AAAI 2025
NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning
AAAI 2025
Beyond the Safety Bundle: Auditing the Helpful and Harmless Dataset
NAACL 2025
<
1
…
30
31
32
…
119
>