Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Permitted Knowledge Boundary: Evaluating the Knowledge-Constrained Responsiveness of Large Language Models
EMNLP 2025
FC-Attack: Jailbreaking Multimodal Large Language Models via Auto-Generated Flowcharts
EMNLP 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
EMNLP 2025
Improving Alignment in LVLMs with Debiased Self-Judgment
EMNLP 2025
Distributional Surgery for Language Model Activations
EMNLP 2025
Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents
EMNLP 2025
Towards Reverse Engineering of Language Models: A Survey
EMNLP 2025
Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models
EMNLP 2025
SafeSwitch: Steering Unsafe LLM Behavior via Internal Activation Signals
EMNLP 2025
Is Safety Standard Same for Everyone? User-Specific Safety Evaluation of Large Language Models
EMNLP 2025
A Knapsack by Any Other Name: Presentation impacts LLM performance on NP-hard problems
EMNLP 2025
LLM Jailbreak Detection for (Almost) Free!
EMNLP 2025
X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Jailbreak Attacks without Compromising Usability
EMNLP 2025
Forewarned is Forearmed: Pre-Synthesizing Jailbreak-like Instructions to Enhance LLM Safety Guardrail to Potential Attacks
EMNLP 2025
Breaking the Reviewer: Assessing the Vulnerability of Large Language Models in Automated Peer Review Under Textual Adversarial Attacks
EMNLP 2025
How Does Cognitive Bias Affect Large Language Models? A Case Study on the Anchoring Effect in Price Negotiation Simulations
EMNLP 2025
PREE: Towards Harmless and Adaptive Fingerprint Editing in Large Language Models via Knowledge Prefix Enhancement
EMNLP 2025
Multilingual Collaborative Defense for Large Language Models
EMNLP 2025
PD3F: A Pluggable and Dynamic DoS-Defense Framework against resource consumption attacks targeting Large Language Models
EMNLP 2025
LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models
EMNLP 2025
SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models
EMNLP 2025
Not All Voices Are Rewarded Equally: Probing and Repairing Reward Models across Human Diversity
EMNLP 2025
Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models
EMNLP 2025
PromptKeeper: Safeguarding System Prompts for LLMs
EMNLP 2025
ASTPrompter: Preference-Aligned Automated Language Model Red-Teaming to Generate Low-Perplexity Unsafe Prompts
EMNLP 2025
<
1
…
44
45
46
…
119
>