Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
NIPS 2024
CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses
NIPS 2024
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
NIPS 2024
Countering Personalized Text-to-Image Generation with Influence Watermarks
CVPR 2024
Reasons to Reject? Aligning Language Models with Judgments
ACL 2024
Chaos with Keywords: Exposing Large Language Models Sycophancy to Misleading Keywords and Evaluating Defense Strategies
ACL 2024
KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions
ACL 2024
EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection
CVPR 2024
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models
NIPS 2024
Unsegment Anything by Simulating Deformation
CVPR 2024
The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness
ACL 2024
ConStat: Performance-Based Contamination Detection in Large Language Models
NIPS 2024
Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination
ACL 2024
Tight Verification of Probabilistic Robustness in Bayesian Neural Networks
AISTATS 2024
Play Guessing Game with LLM: Indirect Jailbreak Attack with Implicit Clues
ACL 2024
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
ACL 2024
Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models
ACL 2024
Unlearning Traces the Influential Training Data of Language Models
ACL 2024
D-NLP at SemEval-2024 Task 2: Evaluating Clinical Inference Capabilities of Large Language Models
SEMEVAL 2024
Groningen Group E at SemEval-2024 Task 8: Detecting machine-generated texts through pre-trained language models augmented with explicit linguistic-stylistic features
SEMEVAL 2024
TU Wien at SemEval-2024 Task 6: Unifying Model-Agnostic and Model-Aware Techniques for Hallucination Detection
SEMEVAL 2024
Compos Mentis at SemEval2024 Task6: A Multi-Faceted Role-based Large Language Model Ensemble to Detect Hallucination
SEMEVAL 2024
From Shortcuts to Triggers: Backdoor Defense with Denoised PoE
NAACL 2024
Two Heads are Better than One: Nested PoE for Robust Defense Against Multi-Backdoors
NAACL 2024
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning
NAACL 2024
<
1
…
67
68
69
…
119
>