Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
IDGuard: Robust General Identity-centric POI Proactive Defense Against Face Editing Abuse
CVPR 2024
Gradient Alignment for Cross-Domain Face Anti-Spoofing
CVPR 2024
RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback
CVPR 2024
HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data
CVPR 2024
MMA-Diffusion: MultiModal Attack on Diffusion Models
CVPR 2024
CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion
CVPR 2024
Backdoor Defense via Test-Time Detecting and Repairing
CVPR 2024
A Framework for Approaching AI Education in Educator Preparation Programs
AAAI 2024
AI Evaluation Authorities: A Case Study Mapping Model Audits to Persistent Standards
AAAI 2024
U-trustworthy Models. Reliability, Competence, and Confidence in Decision-Making
AAAI 2024
IPRemover: A Generative Model Inversion Attack against Deep Neural Network Fingerprinting and Watermarking
AAAI 2024
Aligning Model Properties via Conformal Risk Control
NIPS 2024
On the Scalability of Certified Adversarial Robustness with Generated Data
NIPS 2024
SEEV: Synthesis with Efficient Exact Verification for ReLU Neural Barrier Functions
NIPS 2024
Fine-Tuning Personalization in Federated Learning to Mitigate Adversarial Clients
NIPS 2024
SuperDeepFool: a new fast and accurate minimal adversarial attack
NIPS 2024
Improved Generation of Adversarial Examples Against Safety-aligned LLMs
NIPS 2024
Can an AI Agent Safely Run a Government? Existence of Probably Approximately Aligned Policies
NIPS 2024
Decision-Making Behavior Evaluation Framework for LLMs under Uncertain Context
NIPS 2024
Critically Assessing the State of the Art in Neural Network Verification
JMLR 2024
Inspecting Prediction Confidence for Detecting Black-Box Backdoor Attacks
AAAI 2024
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
EMNLP 2024
Mechanistic Understanding and Mitigation of Language Model Non-Factual Hallucinations
EMNLP 2024
A Test Suite of Prompt Injection Attacks for LLM-based Machine Translation
EMNLP 2024
Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors
EMNLP 2024
<
1
…
75
76
77
…
119
>