Dirk Groeneveld
13 papers · 2018–2025 · 7 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+9 more ↓ Show less ↑
🏃 Academic Marathon (7) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌍 Conference Polyglot (7) 🐝 Cross-Pollinator (12)
🌍
Conference Polyglot
(7)
🏃
Academic Marathon
(7)
🌈
Renaissance Researcher
(6)
👥
Mega-Team
(60)
🧬
Topic Evolution
⚡
Prolific Year
(5)
💎
Century Club
(13)
🗃️
Keyword Collector
(59)
❓
The Questioner
Conferences
ACL (3)
EMNLP (3)
ICLR (2)
NIPS (2)
CVPR (1)
ICML (1)
NAACL (1)
Top co-authors
Keywords
large language model
(3)
language model
(3)
language model pretraining
(2)
data curation
(2)
data filtering
(2)
web corpus
(2)
training datum
(2)
entity linking
(1)
question decomposition
(1)
transfer learning
(1)
context understanding
(1)
corpus construction
(1)
image captioning
(1)
continued pretraining
(1)
multimodal learning
(1)
visual question answering
(1)
model training
(1)
text retrieval
(1)
language modeling
(1)
dataset collection
(1)
Papers
OLMoE: Open Mixture-of-Experts Language Models
ICLR 2025
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
ACL 2025
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
CVPR 2025
DataDecide: How to Predict Best Pretraining Data with Small Experiments
ICML 2025
OLMo: Accelerating the Science of Language Models
ACL 2024
What's In My Big Data?
ICLR 2024
DataComp-LM: In search of the next generation of training sets for language models
NIPS 2024
Paloma: A Benchmark for Evaluating Language Model Fit
NIPS 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
ACL 2024
Continued Pretraining for Better Zero- and Few-Shot Promptability
EMNLP 2022
Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus
EMNLP 2021
A Simple Yet Strong Pipeline for HotpotQA
EMNLP 2020
Construction of the Literature Graph in Semantic Scholar
NAACL 2018