Elixir/HTML dump scraper: Difference between revisions

Revision as of 21:52, 17 July 2023

A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis. Previously, only the raw wikitext was available for download and this format is notoriously difficult to make sense of because parsing depends on many layers of templates. However, with the HTML dumps we can use standard tooling to extract structure and information—and the original wikitext is still available as RDFa annotations.

At my day job doing Technical Wishes for Wikimedia Germany, we found a motivation to parse these new dumps: it's the only reliable way to count the footnotes. I'll go into some detail about why other data sources wouldn't be sufficient, because it showcases the challenges of wikitext and the relative simplicity and dependability of HTML dumps.

Project link ((in progress)):
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump

Reference parsing

What are references?

References are the little footnotes all over Wikipedia articles:^{[example-footnote 1]} These footnotes ground the writing in sources, and are a key aspect of the intellectual culture of Wikipedia since all claims are supposed to be paraphrased from existing secondary sources.

↑ This is a footnote body.

Challenges in wikitext

A raw reference is straightforward in wikitext and looks like: <ref>This footnote.</ref>. If this were the end of the story, it would be simple to parse references. What makes it more complicated is that many references are produced using reusable templates, for example: {{sfn|Hacker|Grimwood|2011|p=290}}.

If "{{sfn}}" were the only template then we could search for "ref" tags and "sfn" templates in wikitext. But a search for reference-producing templates unveils over 12,000 different templates on English Wikipedia alone, and these will be different on every wiki and language edition.

Simplicity of HTML

Once the wikitext is fully rendered to HTML, we can finally see all of the footnotes which were produced. They appear something like this, <div typeof="mw:Extension/ref">Footnote text.</div>

Since the rendering is complete, we know exactly which references are visible, which is better than the potential references that we might have been able to determine from a static analysis of each template.

Template expansion also maps to HTML hierarchical structure, which makes it possible to tell when a reference was produced by templates or when a reference contains templates. Both of these cases are interesting to our research.

Resumability

The scraping job is extremely slow—our first run took two months. If the job crashes for any reason, it's crucial that we can resume again at roughly the same place it was stopped.

We've implemented two levels of resumability and idempotence:

The processing is broken down into small units which each write a single file. If a file already exists, then we skip the corresponding calculation. This general caching technique is known as memoization.

During each step of processing, we also write to a separate checkpoint file after every 100 articles. The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash results in inconsistency. When resuming, if a checkpoint file is present then the job will skip articles until it catches up with the number given there. The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is that some articles might be duplicated in the output.

[1] This is a footnote body.

[example-footnote 1]

@@ Line 1: / Line 1: @@
 [[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
-A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis.  Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of.  With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations.
+A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of articles for most wikis.  Previously, only the raw wikitext was available for download and this format is notoriously difficult to make sense of because parsing depends on many layers of templates.  However, with the HTML dumps we can use standard tooling to extract structure and information—and the original wikitext is still available as RDFa annotations.
-At my day job working on [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for Wikimedia Germany, we found a motivation to parse these dumps: it's the only reliable way to tally reference footnotes.  I'll go into some detail about why other data sources wouldn't be sufficient, because it showcases the challenges of wikitext and the relative simplicity and dependability of HTML dumps.
+At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a motivation to parse these new dumps: it's the only reliable way to count the footnotes.  I'll go into some detail about why other data sources wouldn't be sufficient, because it showcases the challenges of wikitext and the relative simplicity and dependability of HTML dumps.
 {{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}
+== Reference parsing ==
 === What are references? ===
-References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body.</ref>  Citations ground the writing in sources, which are especially important on Wikipedia because of the rule against so-called "original research".  Factual claims are supposed to be paraphrased from existing secondary sources.
+References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body.</ref>  These footnotes ground the writing in sources, and are a key aspect of the intellectual culture of Wikipedia since all claims are supposed to be paraphrased from existing secondary sources.
 <references group="example-footnote" />
-A raw reference looks like <code><nowiki><ref>This footnote.</ref></nowiki></code>.  Most references are fancier, and many rely on reusable structures called templates.  Here's a short example of a template that would produce a footnote, <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>.
+=== Challenges in wikitext ===
+A raw reference is straightforward in wikitext and looks like: <code><nowiki><ref>This footnote.</ref></nowiki></code>.  If this were the end of the story, it would be simple to parse references.  What makes it more complicated is that many references are produced using reusable templates, for example: <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>.
+If "<nowiki>{{sfn}}</nowiki>" were the only template then we could  search for "ref" tags and "sfn" templates in wikitext.  But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different templates on English Wikipedia alone, and these will be different on every wiki and language edition.
+=== Simplicity of HTML ===
+Once the wikitext is fully rendered to HTML, we can finally see all of the footnotes which were produced.  They appear something like this, <code><nowiki><div typeof="mw:Extension/ref">Footnote text.</div></nowiki></code>
+Since the rendering is complete, we know exactly which references are visible, which is better than the ''potential'' references that we might have been able to determine from a static analysis of each template.
+Template expansion also maps to HTML hierarchical structure, which makes it possible to tell when a reference was produced by templates or when a reference contains templates.  Both of these cases are interesting to our research.
+== Resumability ==
+The scraping job is extremely slow—our first run took two months.  If the job crashes for any reason, it's crucial that we can resume again at roughly the same place it was stopped.
+We've implemented two levels of resumability and idempotence:
-=== Challenges of wikitext ===
+The processing is broken down into small units which each write a single file.  If a file already exists, then we skip the corresponding calculation.  This general caching technique is known as [[w:Memoization|memoization]].
-If "<nowiki>{{sfn}}</nowiki>" were the only template then we could just search for "ref" tags and "sfn" templates in wikitext.  However, a conservative [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] shows over 12,000 on English Wikipedia alone, and not counting those that differ on every other wiki and language.
-Only the fully-rendered HTML article shows the final footnotes.
+During each step of processing, we also write to a separate checkpoint file after every 100 articles.  The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash results in inconsistency.  When resuming, if a checkpoint file is present then the job will skip articles until it catches up with the number given there.  The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is that some articles might be duplicated in the output.