Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Advancing Oversight Reasoning across Languages for Audit Sycophantic Behaviour via X-Agent
EMNLP 2025
Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training
EMNLP 2025
Speculative Safety-Aware Decoding
EMNLP 2025
Can You Trick the Grader? Adversarial Persuasion of LLM Judges
EMNLP 2025
The Confidence Paradox: Can LLM Know When It’s Wrong?
IJCNLP 2025
Neutral Is Not Unbiased: Evaluating Implicit and Intersectional Identity Bias in LLMs Through Structured Narrative Scenarios
EMNLP 2025
Towards Better Value Principles for Large Language Model Alignment: A Systematic Evaluation and Enhancement
ACL 2025
Data to Defense: The Role of Curation in Aligning Large Language Models Against Safety Compromise
EMNLP 2025
Can an Individual Manipulate the Collective Decisions of Multi-Agents?
EMNLP 2025
On Guardrail Models’ Robustness to Mutations and Adversarial Attacks
EMNLP 2025
MTSA: Multi-turn Safety Alignment for LLMs through Multi-round Red-teaming
ACL 2025
Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models
EMNLP 2025
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
EMNLP 2025
Beneath the Facade: Probing Safety Vulnerabilities in LLMs via Auto-Generated Jailbreak Prompts
EMNLP 2025
Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge
EMNLP 2025
BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages
IJCNLP 2025
Universal Acoustic Adversarial Attacks for Flexible Control of Speech-LLMs
EMNLP 2025
Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling
AAAI 2025
SudoLM: Learning Access Control of Parametric Knowledge with Authorization Alignment
ACL 2025
Assessing Reliability and Political Bias In LLMs’ Judgements of Formal and Material Inferences With Partisan Conclusions
ACL 2025
PricingLogic: Evaluating LLMs Reasoning on Complex Tourism Pricing Tasks
EMNLP 2025
FinTrust: A Comprehensive Benchmark of Trustworthiness Evaluation in Finance Domain
EMNLP 2025
SUA: Stealthy Multimodal Large Language Model Unlearning Attack
EMNLP 2025
Learn and Unlearn: Addressing Misinformation in Multilingual LLMs
EMNLP 2025
Challenges and Remedies of Domain-Specific Classifiers as LLM Guardrails: Self-Harm as a Case Study
NAACL 2025
<
1
…
40
41
42
…
119
>