Research Explorer
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
About
Methodology
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
HausaNLP at SemEval-2025 Task 3: Towards a Fine-Grained Model-Aware Hallucination Detection
SEMEVAL 2025
STAR: Self-Automated Back-Querying for Production Data Generation
IJCNLP 2025
Mirror Minds : An Empirical Study on Detecting LLM-Generated Text via LLMs
COLING 2025
Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts
COLING 2025
Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models
ACL 2025
SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs
COLING 2025
Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
COLING 2025
When Backdoors Speak: Understanding LLM Backdoor Attacks Through Model-Generated Explanations
ACL 2025
MALTO at SemEval-2025 Task 3: Detecting Hallucinations in LLMs via Uncertainty Quantification and Larger Model Validation
SEMEVAL 2025
Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models
COLING 2025
Lightweight Safety Guardrails Using Fine-tuned BERT Embeddings
COLING 2025
Bias in the Mirror : Are LLMs opinions robust to their own adversarial attacks
ACL 2025
Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
COLING 2025
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
ACL 2025
Tuning-Free Accountable Intervention for LLM Deployment – a Metacognitive Approach
AAAI 2025
AILS-NTUA at SemEval-2025 Task 3: Leveraging Large Language Models and Translation Strategies for Multilingual Hallucination Detection
SEMEVAL 2025
AILS-NTUA at SemEval-2025 Task 4: Parameter-Efficient Unlearning for Large Language Models using Data Chunking
SEMEVAL 2025
Evolution of Aegis: Fault Diagnosis for AI Model Training Service in Production
NSDI 2025
Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level
ACL 2025
Can We Detect Failures Without Failure Data? Uncertainty-Aware Runtime Failure Detection for Imitation Learning Policies
RSS 2025
Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing
AACL 2025
Binary Classifier Optimization for Large Language Model Alignment
ACL 2025
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
AACL 2025
BlueToad at SemEval-2025 Task 3: Using Question-Answering-Based Language Models to Extract Hallucinations from Machine-Generated Text
SEMEVAL 2025
Algorithmic Fidelity of Large Language Models in Generating Synthetic German Public Opinions: A Case Study
ACL 2025
<
1
…
25
26
27
…
119
>