Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Unmasking Style Sensitivity: A Causal Analysis of Bias Evaluation Instability in Large Language Models
ACL 2025
Jailbreak Attack Initializations as Extractors of Compliance Directions
EMNLP 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
ACL 2025
CLARITY: Clinical Assistant for Routing, Inference, and Triage
EMNLP 2025
Understanding the Dark Side of LLMs’ Intrinsic Self-Correction
ACL 2025
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
EMNLP 2025
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation
ACL 2025
Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling
AAAI 2025
Contrasting Adversarial Perturbations: The Space of Harmless Perturbations
AAAI 2025
GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization
AAAI 2025
LongSafety: Enhance Safety for Long-Context LLMs
ACL 2025
Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions
ACL 2025
Using Humor to Bypass Safety Guardrails in Large Language Models
ACL 2025
RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization
ACL 2025
HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States
ACL 2025
CRAFT: Class Ranking Aware Fine-Tuning for Enhanced Out-of-Distribution Detection
WACV 2025
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
COLING 2025
UTF: Under-trained Tokens as Fingerprints —— a Novel Approach to LLM Identification
ACL 2025
Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates
ACL 2025
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models
ACL 2025
Can LLMs Recognize Their Own Analogical Hallucinations? Evaluating Uncertainty Estimation for Analogical Reasoning
ACL 2025
Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals
ACL 2025
Jailbreaking? One Step Is Enough!
ACL 2025
Pretend Benign: A Stealthy Adversarial Attack by Exploiting Vulnerabilities in Cooperative Perception
ICCV 2025
ELAB: Extensive LLM Alignment Benchmark in Persian Language
ACL 2025
<
1
…
20
21
22
…
119
>