conftrace_

Artificial Intelligence › Core AI ›

AI Safety

2,972 papers

Papers per year

Papers

SocialStigmaQA: A Benchmark to Uncover Stigma Amplification in Generative Language Models AAAI 2024

Data-Driven Discovery of Design Specifications (Student Abstract) AAAI 2024

Unsegment Anything by Simulating Deformation CVPR 2024

Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning EMNLP 2024

Context-aware Watermark with Semantic Balanced Green-red Lists for Large Language Models EMNLP 2024

Glue pizza and eat rocks - Exploiting Vulnerabilities in Retrieval-Augmented Generative Models EMNLP 2024

Diversity-Aware Annotation for Conversational AI Safety COLING 2024

HonestLLM: Toward an Honest and Helpful Large Language Model NIPS 2024

Uncovering Safety Risks of Large Language Models through Concept Activation Vector NIPS 2024

TU Wien at SemEval-2024 Task 6: Unifying Model-Agnostic and Model-Aware Techniques for Hallucination Detection SEMEVAL 2024

Injecting Undetectable Backdoors in Obfuscated Neural Networks and Language Models NIPS 2024

Don’t be my Doctor! Recognizing Healthcare Advice in Large Language Models EMNLP 2024

Adaptive Randomized Smoothing: Certified Adversarial Robustness for Multi-Step Defences NIPS 2024

STOP! Benchmarking Large Language Models with Sensitivity Testing on Offensive Progressions EMNLP 2024

Defeasible Normative Reasoning: A Proof-Theoretic Integration of Logical Argumentation AAAI 2024

Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models EACL 2024

Random Smooth-based Certified Defense against Text Adversarial Attack EACL 2024

Certified Adversarial Robustness via Randomized $\alpha$-Smoothing for Regression Models NIPS 2024

WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs NIPS 2024

Citation: A Key to Building Responsible and Accountable Large Language Models NAACL 2024

“They are uncultured”: Unveiling Covert Harms and Social Threats in LLM Generated Conversations EMNLP 2024

Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack NIPS 2024

A Comprehensive Survey of Hallucination in Large Language, Image, Video and Audio Foundation Models EMNLP 2024

Aligning Model Properties via Conformal Risk Control NIPS 2024

Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness NIPS 2024