Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Hermit Kingdom Through the Lens of Multiple Perspectives: A Case Study of LLM Hallucination on North Korea
COLING 2025
Jailbreak LLMs through Internal Stance Manipulation
EMNLP 2025
CONTRANS: Weak-to-Strong Alignment Engineering via Concept Transplantation
COLING 2025
Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models
COLING 2025
What’s the most important value? INVP: INvestigating the Value Priorities of LLMs through Decision-making in Social Scenarios
COLING 2025
PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
EMNLP 2025
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique
EMNLP 2025
HAF-RM: A Hybrid Alignment Framework for Reward Model Training
ACL 2025
Towards Statistical Factuality Guarantee for Large Vision-Language Models
EMNLP 2025
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
EMNLP 2025
CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs
EMNLP 2025
SConU: Selective Conformal Uncertainty in Large Language Models
ACL 2025
SafeScientist: Enhancing AI Scientist Safety for Risk-Aware Scientific Discovery
EMNLP 2025
Large Language Models Have Intrinsic Meta-Cognition, but Need a Good Lens
EMNLP 2025
Stand on The Shoulders of Giants: Building JailExpert from Previous Attack Experience
EMNLP 2025
SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters
NAACL 2025
Decoding Hate: Exploring Language Models’ Reactions to Hate Speech
NAACL 2025
SafetyQuizzer: Timely and Dynamic Evaluation on the Safety of LLMs
NAACL 2025
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
NAACL 2025
From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection
NAACL 2025
Have LLMs Reopened the Pandora’s Box of AI-Generated Fake News?
NAACL 2025
Extracting and Understanding the Superficial Knowledge in Alignment
NAACL 2025
Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems
ACL 2025
Dynamic Evaluation with Cognitive Reasoning for Multi-turn Safety of Large Language Models
ACL 2025
HalLoc: Token-level Localization of Hallucinations for Vision Language Models
CVPR 2025
<
1
…
35
36
37
…
119
>