conftrace_

Paul Röttger

37 papers · 2021–2026 · 10 conferences · across top CS/AI conferences

Achievements

Jump to papers ↓

+15 more ↓

🌍 Conference Polyglot (9) 🐝 Cross-Pollinator (9) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🏃 Academic Marathon (5)

🧭 Keyword Pioneer 🌈 Renaissance Researcher (7) 🐣 Hot Topic Early Bird 👥 Mega-Team (27) 👑 Triple Crown 🏆 Grand Slam 🤝 Dynamic Duo (13) 🔬 Deep Specialist (11) 🧬 Topic Evolution 🏆 Keyword Champion (3) 🗃️ Keyword Collector (144) ⚡ Prolific Year (12) 🔥 Unstoppable (5) 💎 Century Club (35) ❓ The Questioner

Conferences

ACL (11) NAACL (10) EMNLP (7) EACL (2) ICLR (2) AAAI (1) ICML (1) IJCNLP (1) NIPS (1) SEMEVAL (1)

Top co-authors

Dirk Hovy (13) Bertie Vidgen (13) Hannah Kirk (5) Scott A. Hale (5) Janet Pierrehumbert (4) Debora Nozza (4) Federico Bianchi (3) Matthias Orlikowski (3) Carolin Holtermann (2) Samuel Fraiberger (2)

Research topics

Resources & Methods (1) Privacy (1)

Keywords

large language model (13) hate speech detection (12) text classification (7) model evaluation (4) dataset evaluation (3) multilingual model (3) value alignment (3) toxic content detection (3) multilingual nlp (3) language model (2) hierarchical taxonomy (2) domain adaptation (2) human feedback (2) responsible ai (2) benchmark evaluation (2) social media analysis (2) prompt engineering (2) content moderation (2) ai safety (2) low-resource language (2)

Papers

Bias in the East, Bias in the West: A Bilingual Analysis of LLM Political Bias on U.S.- and China-Related Issues EACL 2026 The Pluralistic Moral Gap: Understanding Moral Judgment and Value Differences between Humans and Large Language Models EACL 2026 Personalization up to a Point: Why Personalized Content Moderation Needs Boundaries, and How We Can Enforce Them EMNLP 2025 SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety AAAI 2025 Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals’ Subjective Text Perceptions ACL 2025 HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter ACL 2025 Around the World in 24 Hours: Probing LLM Knowledge of Time and Place ACL 2025 Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance EMNLP 2025 TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent EMNLP 2025 No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models EMNLP 2025 Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation ICLR 2025 AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages NAACL 2025 Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations NAACL 2025 XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models NAACL 2024 From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets NAACL 2024 Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset NAACL 2024 Compromesso! Italian Many-Shot Jailbreaks undermine the safety of Large Language Models ACL 2024 Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models ACL 2024 The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models NIPS 2024 Improving Covert Toxicity Detection by Retrieving and Generating References NAACL 2024 Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions ICLR 2024 Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ ACL 2024 Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI ICML 2024 Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts NAACL 2024 “My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models ACL 2024 SemEval-2023 Task 10: Explainable Detection of Online Sexism ACL 2023 Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore ACL 2023 SemEval-2023 Task 10: Explainable Detection of Online Sexism SEMEVAL 2023 The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values EMNLP 2023 The Ecological Fallacy in Annotation: Modeling Human Label Variation goes beyond Sociodemographics ACL 2023 Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-Based Hate NAACL 2022 Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages EMNLP 2022 Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks NAACL 2022 Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models NAACL 2022 Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media EMNLP 2021 HateCheck: Functional Tests for Hate Speech Detection Models IJCNLP 2021 HateCheck: Functional Tests for Hate Speech Detection Models ACL 2021