Methodology

How the data is collected, classified, and presented.

1. Data sources

Papers come from official proceedings pages for each conference:

Scrapers live in the researchpooler Python project. They fetch title, authors, year, venue, and PDF URL. Abstracts are extracted from HTML pages where available (~84% of papers).

2. Topic classification

Each paper with an abstract is classified into a 3-level taxonomy by an LLM with chain-of-thought reasoning. The pipeline runs in two stages:

  1. Pre-classification (PG, zero LLM cost): match title + abstract against 17k known keywords. The matched keywords vote for candidate top-level domains. This narrows the taxonomy from 290 nodes down to 2-3 relevant branches.
  2. LLM classification: the LLM (minimax-m2.5-free via OpenCode CLI) sees only the pruned sub-taxonomy and the full abstract. It returns selected topic paths, normalised keywords, and a one-sentence reasoning citing words from the paper.

This two-stage design reduces tokens by 67-83% and improves accuracy (~98% on validated samples) by avoiding the cognitive load of choosing from 290 options.

3. Taxonomy evolution

The taxonomy is not static. When the LLM consistently proposes the same new topic that doesn't yet exist (e.g. "ML > Learning Types > In-Context Learning"), the topic is added automatically once it covers ≥5 papers (rake classify:evolve_taxonomy).

4. Achievements and trends

Author achievements (15 badges, see /achievements) are derived from materialized views over the papers/authors/topics graph. Rarity tiers are computed from real distribution (legendary < 0.1%, epic < 1%, rare < 5%, uncommon < 10%, common > 10%).

Emerging terms (/trends) come from n-grams in paper titles. We tokenise all titles, count year-over-year occurrences, and surface terms with min 5 occurrences over min 2 years and positive growth rate.

5. Co-occurrence graph (Neo4j)

17k keywords + 290 topics + 220k papers form a knowledge graph in Neo4j. Edges:

The /explore graph view runs Cypher queries against this graph for 8 different visualisation presets.

6. Limitations

7. Open source

Everything is on GitHub: research-explorer (Rails 8 + PostgreSQL + Neo4j) and researchpooler (Python scrapers + LLM classification).