Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
LSSF: Safety Alignment for Large Language Models through Low-Rank Safety Subspace Fusion
ACL 2025
LongSafety: Evaluating Long-Context Safety of Large Language Models
ACL 2025
Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights
ACL 2025
Adversarial Preference Learning for Robust LLM Alignment
ACL 2025
Are the Values of LLMs Structurally Aligned with Humans? A Causal Perspective
ACL 2025
Extended Abstract: Probing-Guided Parameter-Efficient Fine-Tuning for Balancing Linguistic Adaptation and Safety in LLM-based Social Influence Systems
ACL 2025
Detoxify-IT: An Italian Parallel Dataset for Text Detoxification
ACL 2025
Multilingual Text-to-Image Generation Magnifies Gender Stereotypes
ACL 2025
Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions
ACL 2025
Can GPTZero’s AI Vocabulary Distinguish Between LLM-Generated and Student-Written Essays?
ACL 2025
Efficient but Vulnerable: Benchmarking and Defending LLM Batch Prompting Attack
ACL 2025
Unmasking Style Sensitivity: A Causal Analysis of Bias Evaluation Instability in Large Language Models
ACL 2025
SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities
ACL 2025
Understanding the Dark Side of LLMs’ Intrinsic Self-Correction
ACL 2025
Beyond Reactive Safety: Risk-Aware LLM Alignment via Long-Horizon Simulation
ACL 2025
Defending against Indirect Prompt Injection by Instruction Detection
EMNLP 2025
Localizing Malicious Outputs from CodeLLM
EMNLP 2025
MidPO: Dual Preference Optimization for Safety and Helpfulness in Large Language Models via a Mixture of Experts Framework
EMNLP 2025
TrapDoc: Deceiving LLM Users by Injecting Imperceptible Phantom Tokens into Documents
EMNLP 2025
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems
EMNLP 2025
Can Large Language Models Identify Implicit Suicidal Ideation? An Empirical Evaluation
EMNLP 2025
SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs
EMNLP 2025
Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction
EMNLP 2025
Curse of Knowledge: Your Guidance and Provided Knowledge are biasing LLM Judges in Complex Evaluation
EMNLP 2025
From Remembering to Metacognition: Do Existing Benchmarks Accurately Evaluate LLMs?
EMNLP 2025
<
1
…
43
44
45
…
119
>