Pedro Ortiz Suarez
13 papers · 2022–2026 · 6 conferences · across top CS/AI conferences
Achievements
Jump to papers ↓+8 more ↓ Show less ↑
๐บ๏ธ Taxonomy Completionist (28) ๐ Interdisciplinary Bridge ๐ Conference Polyglot (5) ๐ Renaissance Researcher (7) ๐งญ Keyword Pioneer
๐งญ
Keyword Pioneer
๐ฃ
Hot Topic Early Bird
๐ฅ
Mega-Team
(54)
๐
Keyword Champion
โก
Prolific Year
(5)
๐
Century Club
(11)
๐๏ธ
Keyword Collector
(60)
โ
The Questioner
Conferences
EMNLP (4)
ACL (3)
COLING (2)
NAACL (2)
EACL (1)
NIPS (1)
Top co-authors
Research topics
Keywords
large language model
(5)
language identification
(3)
multilingual corpus
(3)
multimodal learning
(2)
multilingual dataset
(2)
text classification
(2)
web datum
(2)
low-resource language
(2)
in-context learning
(1)
multilingual nlp
(1)
transfer learning
(1)
corpus construction
(1)
responsible ai
(1)
corpus linguistics
(1)
language model training
(1)
machine translation
(1)
prior probability
(1)
corpus creation
(1)
named entity recognition
(1)
instruction tuning
(1)
Papers
How Should We Model the Probability of a Language?
EACL 2026
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data
ACL 2026
Identifying Rare Languages in Common Crawl Data is a Needles-in-a-Haystack Problem
EMNLP 2025
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
ACL 2025
Building Data Infrastructure for Low-Resource Languages
NAACL 2025
Table Understanding and (Multimodal) LLMs: A Cross-Domain Case Study on Scientific vs. Non-Scientific Data
ACL 2025
Molyรฉ: A Corpus-based Approach to Language Contact in Colonial France
EMNLP 2024
A CURATEd CATalog: Rethinking the Extraction of Pretraining Corpora for Mid-Resourced Languages
COLING 2024
Community OSCAR: A Community Effort for Multilingual Web Data
EMNLP 2024
Occiglot at WMT24: European Open-source Large Language Models Evaluated on Translation
EMNLP 2024
Tokenizer Choice For LLM Training: Negligible or Crucial?
NAACL 2024
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
NIPS 2022
A Data-driven Approach to Named Entity Recognition for Early Modern French
COLING 2022