Luca Soldaini
40 papers · 2015–2026 · 13 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+13 more ↓ Show less ↑
π Conference Polyglot (13) π Academic Marathon (10) π Interdisciplinary Bridge π§ Keyword Pioneer π Cross-Pollinator (12)
π
Renaissance Researcher
(6)
π£
Hot Topic Early Bird
π
Conference Polyglot
(13)
π€
Dynamic Duo
(17)
π
Grand Slam
π₯
Mega-Team
(60)
π¬
Deep Specialist
(10)
π§¬
Topic Evolution
ποΈ
Keyword Collector
(143)
β
The Questioner
(3)
β‘
Prolific Year
(10)
π₯
Unstoppable
(6)
π
Century Club
(39)
Conferences
ACL (8)
EMNLP (8)
NAACL (6)
EACL (3)
ICLR (3)
COLING (2)
ICML (2)
IJCNLP (2)
NIPS (2)
AAAI (1)
AACL (1)
CVPR (1)
SEMEVAL (1)
Top co-authors
Keywords
question answering
(8)
large language model
(6)
answer sentence selection
(5)
information retrieval
(4)
text classification
(4)
language model
(4)
data curation
(3)
document retrieval
(3)
instruction following
(3)
multimodal learning
(3)
domain adaptation
(2)
knowledge distillation
(2)
transfer learning
(2)
information extraction
(2)
training datum
(2)
data filtering
(2)
machine reading comprehension
(2)
mathematical reasoning
(2)
text generation
(2)
model compression
(2)
Papers
The olmOCR Project: Building Fully Open OCR using VLMs
ACL 2026
DataDecide: How to Predict Best Pretraining Data with Small Experiments
ICML 2025
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
CVPR 2025
OLMoE: Open Mixture-of-Experts Language Models
ICLR 2025
Language models scale reliably with over-training and on downstream tasks
ICLR 2025
RouterRetriever: Routing over a Mixture of Expert Embedding Models
AAAI 2025
SciRIFF: A Resource to Enhance Language Model Instruction-Following over Scientific Literature
EMNLP 2025
FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions
NAACL 2025
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Studentsβ Hand-Drawn Math Images
NAACL 2025
Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
ICML 2025
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens
ACL 2025
DataComp-LM: In search of the next generation of training sets for language models
NIPS 2024
Paloma: A Benchmark for Evaluating Language Model Fit
NIPS 2024
AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters
ACL 2024
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
ACL 2024
OLMo: Accelerating the Science of Language Models
ACL 2024
KIWI: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions
ACL 2024
When do Generative Query and Document Expansions Fail? A Comprehensive Study Across Methods, Retrievers, and Datasets
EACL 2024
MathFish: Evaluating Language Model Math Reasoning via Grounding in Educational Curricula
EMNLP 2024
What's In My Big Data?
ICLR 2024
PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents
EMNLP 2023
Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval
EMNLP 2023
Embedding Recycling for Language Models
EACL 2023
A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents
EMNLP 2023
Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems
EMNLP 2022
Knowledge Transfer from Answer Ranking to Answer Generation
EMNLP 2022
Cross-Lingual Open-Domain Question Answering with Answer Sentence Generation
AACL 2022
Cross-Lingual Open-Domain Question Answering with Answer Sentence Generation
IJCNLP 2022
Paragraph-based Transformer Pre-training for Multi-Sentence Inference
NAACL 2022
Pre-training Transformer Models with Sentence-Level Objectives for Answer Sentence Selection
EMNLP 2022
Answer Generation for Retrieval-based Question Answering Systems
ACL 2021
Modeling Context in Answer Sentence Selection Systems on a Latency Budget
EACL 2021
Answer Generation for Retrieval-based Question Answering Systems
IJCNLP 2021
The Cascade Transformer: an Application for Efficient Answer Sentence Selection
ACL 2020
Multi-task Learning of Spoken Language Understanding by Integrating N-Best Hypotheses with Hierarchical Attention
COLING 2020
SMHD: a Large-Scale Resource for Exploring Online Language Usage for Multiple Mental Health Conditions
COLING 2018
RSDD-Time: Temporal Annotation of Self-Reported Mental Health Diagnoses
NAACL 2018
Helping or Hurting? Predicting Changes in Usersβ Risk of Self-Harm Through Online Community Interactions
NAACL 2018
GU IRLAB at SemEval-2018 Task 7: Tree-LSTMs for Scientific Relation Classification
SEMEVAL 2018
Matching Citation Text and Cited Spans in Biomedical Literature: a Search-Oriented Approach
NAACL 2015