Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries
NAACL 2025
Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training
ACL 2025
Learn and Unlearn: Addressing Misinformation in Multilingual LLMs
EMNLP 2025
RED: Unleashing Token-Level Rewards from Holistic Feedback via Reward Redistribution
EMNLP 2025
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine
NAACL 2025
Smaller Large Language Models Can Do Moral Self-Correction
NAACL 2025
Knowledge Boundary of Large Language Models: A Survey
ACL 2025
A Comprehensive Evaluation of Cognitive Biases in LLMs
NAACL 2025
A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient
NAACL 2025
Stepwise Reasoning Disruption Attack of LLMs
ACL 2025
MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique
EMNLP 2025
CIC-NLP@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
NAACL 2025
SSNTrio@DravidianLangTech 2025: Identification of AI Generated Content in Dravidian Languages using Transformers
NAACL 2025
Improve Safety Training of Large Language Models with Safety-Critical Singular Vectors Localization
ACL 2025
Tongue-Tied: Breaking LLMs Safety Through New Language Learning
NAACL 2025
Aligning to What? Limits to RLHF Based Alignment
NAACL 2025
SafeRAG: Benchmarking Security in Retrieval-Augmented Generation of Large Language Model
ACL 2025
MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models
EMNLP 2025
Think in Safety: Unveiling and Mitigating Safety Alignment Collapse in Multimodal Large Reasoning Model
EMNLP 2025
R-TOFU: Unlearning in Large Reasoning Models
EMNLP 2025
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
NAACL 2025
A Practical Examination of AI-Generated Text Detectors for Large Language Models
NAACL 2025
Confidence v.s. Critique: A Decomposition of Self-Correction Capability for LLMs
ACL 2025
WordGame: Efficient & Effective LLM Jailbreak via Simultaneous Obfuscation in Query and Response
NAACL 2025
Improving Consistency in LLM Inference using Probabilistic Tokenization
NAACL 2025
<
1
…
23
24
25
…
119
>