Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention
ICCV 2025
PLA: Prompt Learning Attack against Text-to-Image Generative Models
ICCV 2025
DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing
ICCV 2025
Prototype Guided Backdoor Defense via Activation Space Manipulation
ICCV 2025
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts
ICCV 2025
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
ACL 2025
Blinded by Context: Unveiling the Halo Effect of MLLM in AI Hiring
ACL 2025
Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems
ACL 2025
PL-Guard: Benchmarking Language Model Safety for Polish
ACL 2025
The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination
ACL 2025
Are Bias Evaluation Methods Biased ?
ACL 2025
Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs
ACL 2025
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
ACL 2025
ELAB: Extensive LLM Alignment Benchmark in Persian Language
ACL 2025
Teaching Large Language Models to Express Knowledge Boundary from Their Own Signals
ACL 2025
Can LLMs Recognize Their Own Analogical Hallucinations? Evaluating Uncertainty Estimation for Analogical Reasoning
ACL 2025
Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates
ACL 2025
UTF: Under-trained Tokens as Fingerprints —— a Novel Approach to LLM Identification
ACL 2025
RedHit: Adaptive Red-Teaming of Large Language Models via Search, Reasoning, and Preference Optimization
ACL 2025
Using Humor to Bypass Safety Guardrails in Large Language Models
ACL 2025
LongSafety: Enhance Safety for Long-Context LLMs
ACL 2025
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving
ACL 2025
X-Guard: Multilingual Guard Agent for Content Moderation
ACL 2025
1-2-3 Check: Enhancing Contextual Privacy in LLM via Multi-Agent Reasoning
ACL 2025
Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency
ACL 2025
<
1
…
48
49
50
…
119
>