Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Learning to Rewrite: Generalized LLM-Generated Text Detection
ACL 2025
Biased LLMs can Influence Political Decision-Making
ACL 2025
LLM as a Broken Telephone: Iterative Generation Distorts Information
ACL 2025
AGrail: A Lifelong Agent Guardrail with Effective and Adaptive Safety Detection
ACL 2025
VLSBench: Unveiling Visual Leakage in Multimodal Safety
ACL 2025
Exploiting the Shadows: Unveiling Privacy Leaks through Lower-Ranked Tokens in Large Language Models
ACL 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
ACL 2025
InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes Under Herd Behavior
ACL 2025
PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration
ACL 2025
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
ACL 2025
Improving Factuality with Explicit Working Memory
ACL 2025
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
ACL 2025
Jailbreaking? One Step Is Enough!
ACL 2025
Generative Psycho-Lexical Approach for Constructing Value Systems in Large Language Models
ACL 2025
HiddenDetect: Detecting Jailbreak Attacks against Multimodal Large Language Models via Monitoring Hidden States
ACL 2025
Can LLMs Ground when they (Don’t) Know: A Study on Direct and Loaded Political Questions
ACL 2025
Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models
ACL 2025
TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis
ACL 2025
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
ACL 2025
Why Safeguarded Ships Run Aground? Aligned Large Language Models’ Safety Mechanisms Tend to Be Anchored in The Template Region
ACL 2025
How to Mitigate Overfitting in Weak-to-strong Generalization?
ACL 2025
M2S: Multi-turn to Single-turn jailbreak in Red Teaming for LLMs
ACL 2025
Sheep’s Skin, Wolf’s Deeds: Are LLMs Ready for Metaphorical Implicit Hate Speech?
ACL 2025
Do not Abstain! Identify and Solve the Uncertainty
ACL 2025
Can You Really Trust Code Copilot? Evaluating Large Language Models from a Code Security Perspective
ACL 2025
<
1
…
41
42
43
…
119
>