Dirk Groeneveld

13 papers · 2018–2025 · 7 conferences · across top CS/AI conferences

Achievements

+9 more ↓

🏃 Academic Marathon (7) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌍 Conference Polyglot (7) 🐝 Cross-Pollinator (12)

🌍 Conference Polyglot (7) 🏃 Academic Marathon (7) 🌈 Renaissance Researcher (6) 👥 Mega-Team (60) 🧬 Topic Evolution ⚡ Prolific Year (5) 💎 Century Club (13) 🗃️ Keyword Collector (59) ❓ The Questioner

Conferences

ACL (3) EMNLP (3) ICLR (2) NIPS (2) CVPR (1) ICML (1) NAACL (1)

Top co-authors

Luca Soldaini (9) Kyle Lo (7) Hannaneh Hajishirzi (7) Akshita Bhagia (7) Jesse Dodge (7) Noah A. Smith (6) Dustin Schwenk (5) Iz Beltagy (5) Yanai Elazar (5) Niklas Muennighoff (5)

Keywords

large language model (3) language model (3) language model pretraining (2) data curation (2) data filtering (2) web corpus (2) training datum (2) entity linking (1) question decomposition (1) transfer learning (1) context understanding (1) corpus construction (1) image captioning (1) continued pretraining (1) multimodal learning (1) visual question answering (1) model training (1) text retrieval (1) language modeling (1) dataset collection (1)

Papers

OLMoE: Open Mixture-of-Experts Language Models ICLR 2025 OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens ACL 2025 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models CVPR 2025 DataDecide: How to Predict Best Pretraining Data with Small Experiments ICML 2025 OLMo: Accelerating the Science of Language Models ACL 2024 What's In My Big Data? ICLR 2024 DataComp-LM: In search of the next generation of training sets for language models NIPS 2024 Paloma: A Benchmark for Evaluating Language Model Fit NIPS 2024 Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research ACL 2024 Continued Pretraining for Better Zero- and Few-Shot Promptability EMNLP 2022 Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus EMNLP 2021 A Simple Yet Strong Pipeline for HotpotQA EMNLP 2020 Construction of the Literature Graph in Semantic Scholar NAACL 2018