Josef Dai
8 papers · 2023–2025 · 4 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+3 more ↓ Show less ↑
π£ Hot Topic Early Bird π Interdisciplinary Bridge π§ Keyword Pioneer π Conference Polyglot (4) π Cross-Pollinator (7)
π
Renaissance Researcher
(5)
πΊοΈ
Taxonomy Completionist
(20)
β‘
Prolific Year
(5)
Conferences
ACL (4)
NIPS (2)
AAAI (1)
ICLR (1)
Top co-authors
Keywords
reinforcement learning from human feedback
(4)
large language model
(3)
reward modeling
(2)
safety alignment
(2)
human preference
(2)
language model alignment
(2)
responsible ai
(1)
model alignment
(1)
data compression
(1)
safe reinforcement learning
(1)
bayesian network
(1)
sequence-to-sequence learning
(1)
safety evaluation
(1)
safety benchmark
(1)
preference datum
(1)
alignment fine-tuning
(1)
model elasticity
(1)
pre-training distribution
(1)
harmful output mitigation
(1)
safe policy optimization
(1)
Papers
SafeLawBench: Towards Safe Alignment of Large Language Models
ACL 2025
Reward Generalization in RLHF: A Topological Perspective
ACL 2025
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback
AAAI 2025
Language Models Resist Alignment: Evidence From Data Compression
ACL 2025
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
ACL 2025
Safe RLHF: Safe Reinforcement Learning from Human Feedback
ICLR 2024
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
NIPS 2023
Safety Gymnasium: A Unified Safe Reinforcement Learning Benchmark
NIPS 2023