conftrace
_
Papers
Trends
Conferences
Explore
Authors
Papers
Trends
Conferences
Explore
Authors
Topics
Keywords
Achievements
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2,972 papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models
AAAI 2024
Data-Driven Discovery of Design Specifications (Student Abstract)
AAAI 2024
Unsegment Anything by Simulating Deformation
CVPR 2024
Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
EMNLP 2024
Context-aware Watermark with Semantic Balanced Green-red Lists for Large Language Models
EMNLP 2024
Glue pizza and eat rocks - Exploiting Vulnerabilities in Retrieval-Augmented Generative Models
EMNLP 2024
Diversity-Aware Annotation for Conversational AI Safety
COLING 2024
HonestLLM: Toward an Honest and Helpful Large Language Model
NIPS 2024
Uncovering Safety Risks of Large Language Models through Concept Activation Vector
NIPS 2024
TU Wien at SemEval-2024 Task 6: Unifying Model-Agnostic and Model-Aware Techniques for Hallucination Detection
SEMEVAL 2024
Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models
NIPS 2024
Don’t be my Doctor! Recognizing Healthcare Advice in Large Language Models
EMNLP 2024
Adaptive Randomized Smoothing: Certified Adversarial Robustness for Multi-Step Defences
NIPS 2024
STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions
EMNLP 2024
Defeasible Normative Reasoning: A Proof-Theoretic Integration of Logical Argumentation
AAAI 2024
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models
EACL 2024
Random Smooth-based Certified Defense against Text Adversarial Attack
EACL 2024
Certified Adversarial Robustness via Randomized $\alpha$-Smoothing for Regression Models
NIPS 2024
WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
NIPS 2024
Citation: A Key to Building Responsible and Accountable Large Language Models
NAACL 2024
“They are uncultured”: Unveiling Covert Harms and Social Threats in LLM Generated Conversations
EMNLP 2024
Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack
NIPS 2024
A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models
EMNLP 2024
Aligning Model Properties via Conformal Risk Control
NIPS 2024
Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness
NIPS 2024
<
1
…
54
55
56
…
119
>