Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions
EMNLP 2024
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales
EMNLP 2024
ChatGPT Doesn’t Trust Chargers Fans: Guardrail Sensitivity in Context
EMNLP 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
EMNLP 2024
Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models
EMNLP 2024
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
EMNLP 2024
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
EMNLP 2024
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?
EMNLP 2024
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
EMNLP 2024
Ranking Manipulation for Conversational Search Engines
EMNLP 2024
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
EMNLP 2024
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
EMNLP 2024
Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
EMNLP 2024
Red Teaming Language Models for Processing Contradictory Dialogues
EMNLP 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
EMNLP 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
EMNLP 2024
Let Me Teach You: Pedagogical Foundations of Feedback for Language Models
EMNLP 2024
GuardBench: A Large-Scale Benchmark for Guardrail Models
EMNLP 2024
Moral Foundations of Large Language Models
EMNLP 2024
State-wise safe reinforcement learning with pixel observations
L4DC 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
EMNLP 2024
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference
EMNLP 2024
RAFT: Realistic Attacks to Fool Text Detectors
EMNLP 2024
Intrinsic Self-correction for Enhanced Morality: An Analysis of Internal Mechanisms and the Superficial Hypothesis
EMNLP 2024
Distract Large Language Models for Automatic Jailbreak Attack
EMNLP 2024
<
1
…
61
62
63
…
119
>