content moderation

193 papers

Explore in graph

Also known as

ACM CM

Co-occurring keywords

text classification (6776) large language model (12755) hate speech detection (716) multimodal learning (4622) toxicity detection (157) harmful content detection (51) natural language processing (2027) text generation (2903) social media analysis (699) language model (4573)

Papers

Detect-and-Guide: Self-regulation of Diffusion Models for Safe Text-to-Image Generation via Guideline Token Optimization CVPR 2025

Erasing Undesirable Influence in Diffusion Models CVPR 2025

Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions ICCV 2025

LionGuard: A Contextualized Moderation Classifier to Tackle Localized Unsafe Content COLING 2025

Conspiracy Theories and Where to Find Them on TikTok ACL 2025

Detoxify-IT: An Italian Parallel Dataset for Text Detoxification ACL 2025

Digital Gatekeepers: Google’s Role in Curating Hashtags and Subreddits ACL 2025

Are you sure? Measuring models bias in content moderation through uncertainty EMNLP 2025

NLP-ADBench: NLP Anomaly Detection Benchmark EMNLP 2025

MOSAIC: Modeling Social AI for Content Dissemination and Regulation in Multi-Agent Simulations EMNLP 2025

Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications ACL 2025

Linking Transparency and Accountability: Analysing The Connection Between TikTok’s Terms of Service and Moderation Decisions EMNLP 2025

Words Matter: Reducing Stigma in Online Conversations about Substance Use with Large Language Models EMNLP 2024

LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models EMNLP 2024

Promoting Constructive Deliberation: Reframing for Receptiveness EMNLP 2024

Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification EMNLP 2024

LLM generated responses to mitigate the impact of hate speech EMNLP 2024

DetoxLLM: A Framework for Detoxification with Explanations EMNLP 2024

Moderation in the Wild: Investigating User-Driven Moderation in Online Discussions EACL 2024

Recent Advances in Online Hate Speech Moderation: Multimodality and the Role of Large Models EMNLP 2024

LLMs to the Rescue: Explaining DSA Statements of Reason with Platform’s Terms of Services EMNLP 2024

GuardT2I: Defending Text-to-Image Models from Adversarial Prompts NIPS 2024

Rethinking Multimodal Content Moderation From an Asymmetric Angle With Mixed-Modality WACV 2024

Unified Concept Editing in Diffusion Models WACV 2024

Demonstrations Are All You Need: Advancing Offensive Content Paraphrasing using In-Context Learning ACL 2024