Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Token-Aware Editing of Internal Activations for Large Language Model Alignment
EMNLP 2025
Reimagining Safety Alignment with An Image
EMNLP 2025
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
EMNLP 2025
Reshaping Representation Space to Balance the Safety and Over-rejection in Large Audio Language Models
EMNLP 2025
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
EMNLP 2025
Learn and Unlearn: Addressing Misinformation in Multilingual LLMs
EMNLP 2025
SUA: Stealthy Multimodal Large Language Model Unlearning Attack
EMNLP 2025
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
EMNLP 2025
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
EMNLP 2025
Can an Individual Manipulate the Collective Decisions of Multi-Agents?
EMNLP 2025
Data to Defense: The Role of Curation in Aligning Large Language Models Against Safety Compromise
EMNLP 2025
Speculative Safety-Aware Decoding
EMNLP 2025
Advancing Oversight Reasoning across Languages for Audit Sycophantic Behaviour via X-Agent
EMNLP 2025
Sycophancy Mitigation Through Reinforcement Learning with Uncertainty-Aware Adaptive Reasoning Trajectories
EMNLP 2025
SimVBG: Simulating Individual Values by Backstory Generation
EMNLP 2025
The Impact of Negated Text on Hallucination with Large Language Models
EMNLP 2025
DiplomacyAgent: Do LLMs Balance Interests and Ethical Principles in International Events?
EMNLP 2025
Unmasking Deceptive Visuals: Benchmarking Multimodal Large Language Models on Misleading Chart Question Answering
EMNLP 2025
Improve LLM-as-a-Judge Ability as a General Ability
EMNLP 2025
ToolSafety: A Comprehensive Dataset for Enhancing Safety in LLM-Based Agent Tool Invocations
EMNLP 2025
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
EMNLP 2025
Reward Model Perspectives: Whose Opinions Do Reward Models Reward?
EMNLP 2025
DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors
EMNLP 2025
Jailbreak LLMs through Internal Stance Manipulation
EMNLP 2025
Pierce the Mists, Greet the Sky: Decipher Knowledge Overshadowing via Knowledge Circuit Analysis
EMNLP 2025
<
1
…
52
53
54
…
119
>