Paul Röttger
37 papers · 2021–2026 · 10 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+15 more ↓ Show less ↑
🌍 Conference Polyglot (9) 🐝 Cross-Pollinator (9) 🧭 Keyword Pioneer 🌉 Interdisciplinary Bridge 🏃 Academic Marathon (5)
🧭
Keyword Pioneer
🌈
Renaissance Researcher
(7)
🐣
Hot Topic Early Bird
👥
Mega-Team
(27)
👑
Triple Crown
🏆
Grand Slam
🤝
Dynamic Duo
(13)
🔬
Deep Specialist
(11)
🧬
Topic Evolution
🏆
Keyword Champion
(3)
🗃️
Keyword Collector
(144)
⚡
Prolific Year
(12)
🔥
Unstoppable
(5)
💎
Century Club
(35)
❓
The Questioner
Conferences
ACL (11)
NAACL (10)
EMNLP (7)
EACL (2)
ICLR (2)
AAAI (1)
ICML (1)
IJCNLP (1)
NIPS (1)
SEMEVAL (1)
Top co-authors
Research topics
Keywords
large language model
(13)
hate speech detection
(12)
text classification
(7)
model evaluation
(4)
dataset evaluation
(3)
multilingual model
(3)
value alignment
(3)
toxic content detection
(3)
multilingual nlp
(3)
language model
(2)
hierarchical taxonomy
(2)
domain adaptation
(2)
human feedback
(2)
responsible ai
(2)
benchmark evaluation
(2)
social media analysis
(2)
prompt engineering
(2)
content moderation
(2)
ai safety
(2)
low-resource language
(2)
Papers
Bias in the East, Bias in the West: A Bilingual Analysis of LLM Political Bias on U.S.- and China-Related Issues
EACL 2026
The Pluralistic Moral Gap: Understanding Moral Judgment and Value Differences between Humans and Large Language Models
EACL 2026
Personalization up to a Point: Why Personalized Content Moderation Needs Boundaries, and How We Can Enforce Them
EMNLP 2025
SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety
AAAI 2025
Beyond Demographics: Fine-tuning Large Language Models to Predict Individuals’ Subjective Text Perceptions
ACL 2025
HateDay: Insights from a Global Hate Speech Dataset Representative of a Day on Twitter
ACL 2025
Around the World in 24 Hours: Probing LLM Knowledge of Time and Place
ACL 2025
Principled Personas: Defining and Measuring the Intended Effects of Persona Prompting on Task Performance
EMNLP 2025
TrojanStego: Your Language Model Can Secretly Be A Steganographic Privacy Leaking Agent
EMNLP 2025
No for Some, Yes for Others: Persona Prompts and Other Sources of False Refusal in Language Models
EMNLP 2025
Surgical, Cheap, and Flexible: Mitigating False Refusal in Language Models via Single Vector Ablation
ICLR 2025
AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages
NAACL 2025
Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations
NAACL 2025
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
NAACL 2024
From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets
NAACL 2024
Improving Adversarial Data Collection by Supporting Annotators: Lessons from GAHD, a German Hate Speech Dataset
NAACL 2024
Compromesso! Italian Many-Shot Jailbreaks undermine the safety of Large Language Models
ACL 2024
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
ACL 2024
The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models
NIPS 2024
Improving Covert Toxicity Detection by Retrieving and Generating References
NAACL 2024
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
ICLR 2024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
ACL 2024
Position: Near to Mid-term Risks and Opportunities of Open-Source Generative AI
ICML 2024
Beyond Flesch-Kincaid: Prompt-based Metrics Improve Difficulty Classification of Educational Texts
NAACL 2024
“My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
ACL 2024
SemEval-2023 Task 10: Explainable Detection of Online Sexism
ACL 2023
Improving the Detection of Multilingual Online Attacks with Rich Social Media Data from Singapore
ACL 2023
SemEval-2023 Task 10: Explainable Detection of Online Sexism
SEMEVAL 2023
The Past, Present and Better Future of Feedback Learning in Large Language Models for Subjective Human Preferences and Values
EMNLP 2023
The Ecological Fallacy in Annotation: Modeling Human Label Variation goes beyond Sociodemographics
ACL 2023
Hatemoji: A Test Suite and Adversarially-Generated Dataset for Benchmarking and Detecting Emoji-Based Hate
NAACL 2022
Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages
EMNLP 2022
Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks
NAACL 2022
Multilingual HateCheck: Functional Tests for Multilingual Hate Speech Detection Models
NAACL 2022
Temporal Adaptation of BERT and Performance on Downstream Document Classification: Insights from Social Media
EMNLP 2021
HateCheck: Functional Tests for Hate Speech Detection Models
IJCNLP 2021
HateCheck: Functional Tests for Hate Speech Detection Models
ACL 2021