Methodology
How the data is collected, classified, and presented.
1. Data sources
Papers come from official proceedings pages for each conference:
- OpenReview — ICLR (2018-)
- PMLR (proceedings.mlr.press) — ICML, AISTATS, CoLT, CoRL, UAI, ACML, L4DC, MIDL, MLHC, PGM, CLeaR, ALT, AutoML
- ACL Anthology — ACL, EMNLP, NAACL, COLING, EACL, AACL, IJCNLP, CoNLL, SemEval
- CVF Open Access — CVPR, ICCV, WACV
- ECVA — ECCV
- NeurIPS Proceedings — NeurIPS (2006-)
- plus AAAI, IJCAI, JMLR, MICCAI, Interspeech, USENIX, RSS
Scrapers live in the researchpooler Python project. They fetch title, authors, year, venue, and PDF URL. Abstracts are extracted from HTML pages where available (~84% of papers).
2. Topic classification
Each paper with an abstract is classified into a 3-level taxonomy by an LLM with chain-of-thought reasoning. The pipeline runs in two stages:
- Pre-classification (PG, zero LLM cost): match title + abstract against 17k known keywords. The matched keywords vote for candidate top-level domains. This narrows the taxonomy from 290 nodes down to 2-3 relevant branches.
- LLM classification: the LLM (minimax-m2.5-free via OpenCode CLI) sees only the pruned sub-taxonomy and the full abstract. It returns selected topic paths, normalised keywords, and a one-sentence reasoning citing words from the paper.
This two-stage design reduces tokens by 67-83% and improves accuracy (~98% on validated samples) by avoiding the cognitive load of choosing from 290 options.
3. Taxonomy evolution
The taxonomy is not static. When the LLM consistently proposes the same new topic that doesn't yet exist (e.g. "ML > Learning Types > In-Context Learning"), the topic is added automatically once it covers ≥5 papers (rake classify:evolve_taxonomy).
4. Achievements and trends
Author achievements (15 badges, see /achievements) are derived from materialized views over the papers/authors/topics graph. Rarity tiers are computed from real distribution (legendary < 0.1%, epic < 1%, rare < 5%, uncommon < 10%, common > 10%).
Emerging terms (/trends) come from n-grams in paper titles. We tokenise all titles, count year-over-year occurrences, and surface terms with min 5 occurrences over min 2 years and positive growth rate.
5. Co-occurrence graph (Neo4j)
17k keywords + 290 topics + 220k papers form a knowledge graph in Neo4j. Edges:
CO_OCCURS— two keywords share at least 2 papers (configurable minimum weight)SPECIALIZATION_OF— keyword is a substring of another (e.g. "convolutional neural network" → "neural network")ABBREVIATION_OF— abbreviation → full form (e.g. "NER" → "named entity recognition")
The /explore graph view runs Cypher queries against this graph for 8 different visualisation presets.
6. Limitations
- Sample bias: only 36 top venues. We miss workshops, journals, preprints, regional conferences.
- Abstract coverage: ~16% of papers (mostly pre-2015) have no abstract. They are stored but excluded from public search.
- Classification on title-only: drops accuracy from ~98% to ~85%. Papers without abstract are marked but not pre-classified.
- LLM dependence: classification is reproducible (deterministic prompts, fixed model) but not auditable per-paper unless you re-run.
7. Open source
Everything is on GitHub: research-explorer (Rails 8 + PostgreSQL + Neo4j) and researchpooler (Python scrapers + LLM classification).