Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
ACL 2025
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
ACL 2025
Cleanse: Uncertainty Estimation Approach Using Clustering-based Semantic Consistency in LLMs
ACL 2025
Are Bias Evaluation Methods Biased ?
ACL 2025
Improving Factuality with Explicit Working Memory
ACL 2025
Language Models Resist Alignment: Evidence From Data Compression
ACL 2025
SEA: Low-Resource Safety Alignment for Multimodal Large Language Models via Synthetic Embeddings
ACL 2025
Removing Prompt-template Bias in Reinforcement Learning from Human Feedback
ACL 2025
The Law of Knowledge Overshadowing: Towards Understanding, Predicting, and Preventing LLM Hallucination
ACL 2025
PL-Guard: Benchmarking Language Model Safety for Polish
ACL 2025
Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs
ACL 2025
Enhancing Security and Strengthening Defenses in Automated Short-Answer Grading Systems
ACL 2025
Blinded by Context: Unveiling the Halo Effect of MLLM in AI Hiring
ACL 2025
PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration
ACL 2025
ELBA-Bench: An Efficient Learning Backdoor Attacks Benchmark for Large Language Models
ACL 2025
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
ACL 2025
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts
ICCV 2025
InvestAlign: Overcoming Data Scarcity in Aligning Large Language Models with Investor Decision-Making Processes Under Herd Behavior
ACL 2025
Prototype Guided Backdoor Defense via Activation Space Manipulation
ICCV 2025
DCT-Shield: A Robust Frequency Domain Defense against Malicious Image Editing
ICCV 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
ACL 2025
FLUE: Streamlined Uncertainty Estimation for Large Language Models
AAAI 2025
Improved Unbiased Watermark for Large Language Models
ACL 2025
PLA: Prompt Learning Attack against Text-to-Image Generative Models
ICCV 2025
Gradient Short-Circuit: Efficient Out-of-Distribution Detection via Feature Intervention
ICCV 2025
<
1
…
21
22
23
…
119
>