Elixir/HTML dump scraper

Revision as of 16:47, 25 March 2023 by Adamw (talk | contribs) (some examples)

A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis. Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of. With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations.

At my day job working for Wikimedia Germany's Technical Wishes, we found a motivation to parse these dumps: it's the only reliable way to tally how reference footnotes are used. I'll go into some detail about why other data sources wouldn't be sufficient, because it showcases the challenges of wikitext and the relative simplicity and dependability of HTML dumps.



What are references?

References are the little footnotes all over Wikipedia articles.[1] Citations are used to ground the writing in sources, which are especially important on Wikipedia because of the rule against so-called "original research". Everything needs to be paraphrased from existing secondary sources.

A raw reference looks like <ref>This footnote.</ref>. But most references are fancier, and rely on reusable structures called templates. They get long, but let's take a simple example {{sfn|Burgess|2011|p=290}}

Challenges of wikitext

  1. Like this one.