Luca Soldaini

40 papers · 2015–2026 · 13 conferences · across top CS/AI conferences

Achievements

+13 more ↓

🌍 Conference Polyglot (13) 🏃 Academic Marathon (10) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (12)

🌈 Renaissance Researcher (6) 🐣 Hot Topic Early Bird 🌍 Conference Polyglot (13) 🤝 Dynamic Duo (17) 🏆 Grand Slam 👥 Mega-Team (60) 🔬 Deep Specialist (10) 🧬 Topic Evolution 🗃️ Keyword Collector (143) ❓ The Questioner (3) ⚡ Prolific Year (10) 🔥 Unstoppable (6) 💎 Century Club (39)

Conferences

ACL (8) EMNLP (8) NAACL (6) EACL (3) ICLR (3) COLING (2) ICML (2) IJCNLP (2) NIPS (2) AAAI (1) AACL (1) CVPR (1) SEMEVAL (1)

Top co-authors

Kyle Lo (18) Arman Cohan (13) Alessandro Moschitti (10) Dirk Groeneveld (9) Hannaneh Hajishirzi (9) Jesse Dodge (7) Noah A. Smith (6) Akshita Bhagia (6) Niklas Muennighoff (6) Oyvind Tafjord (5)

Keywords

question answering (8) large language model (6) answer sentence selection (5) information retrieval (4) text classification (4) language model (4) data curation (3) document retrieval (3) instruction following (3) multimodal learning (3) domain adaptation (2) knowledge distillation (2) transfer learning (2) information extraction (2) training datum (2) data filtering (2) machine reading comprehension (2) mathematical reasoning (2) text generation (2) model compression (2)

Papers

The olmOCR Project: Building Fully Open OCR using VLMs ACL 2026 DataDecide: How to Predict Best Pretraining Data with Small Experiments ICML 2025 Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models CVPR 2025 OLMoE: Open Mixture-of-Experts Language Models ICLR 2025 Language models scale reliably with over-training and on downstream tasks ICLR 2025 RouterRetriever: Routing over a Mixture of Expert Embedding Models AAAI 2025 SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature EMNLP 2025 FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions NAACL 2025 DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images NAACL 2025 Organize the Web: Constructing Domains Enhances Pre-Training Data Curation ICML 2025 OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens ACL 2025 DataComp-LM: In search of the next generation of training sets for language models NIPS 2024 Paloma: A Benchmark for Evaluating Language Model Fit NIPS 2024 AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters ACL 2024 Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research ACL 2024 OLMo: Accelerating the Science of Language Models ACL 2024 KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions ACL 2024 When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets EACL 2024 MathFish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula EMNLP 2024 What's In My Big Data? ICLR 2024 PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents EMNLP 2023 Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval EMNLP 2023 Embedding Recycling for Language Models EACL 2023 A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents EMNLP 2023 Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems EMNLP 2022 Knowledge Transfer from Answer Ranking to Answer Generation EMNLP 2022 Cross-Lingual Open-Domain Question Answering with Answer Sentence Generation AACL 2022 Cross-Lingual Open-Domain Question Answering with Answer Sentence Generation IJCNLP 2022 Paragraph-based Transformer Pre-training for Multi-Sentence Inference NAACL 2022 Pre-training Transformer Models with Sentence-Level Objectives for Answer Sentence Selection EMNLP 2022 Answer Generation for Retrieval-based Question Answering Systems ACL 2021 Modeling Context in Answer Sentence Selection Systems on a Latency Budget EACL 2021 Answer Generation for Retrieval-based Question Answering Systems IJCNLP 2021 The Cascade Transformer: an Application for Efficient Answer Sentence Selection ACL 2020 Multi-task Learning of Spoken Language Understanding by Integrating N-Best Hypotheses with Hierarchical Attention COLING 2020 SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions COLING 2018 RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses NAACL 2018 Helping or Hurting? Predicting Changes in Users’ Risk of Self-Harm Through Online Community Interactions NAACL 2018 GU IRLAB at SemEval-2018 Task 7: Tree-LSTMs for Scientific Relation Classification SEMEVAL 2018 Matching Citation Text and Cited Spans in Biomedical Literature: a Search-Oriented Approach NAACL 2015