Elixir/HTML dump scraper: Difference between revisions

(7 intermediate revisions by 2 users not shown)

Line 1:

[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]

[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb|''Introverted Millipede'']]

~~A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [~~https://~~dumps~~.~~wikimedia.org~~/~~other~~/~~enterprise_html~~/ ~~HTML dumps] of all articles on~~ wiki. Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render. MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-~~generated templates. This recursively parsed content can never expanded exactly as it was intended when written.~~

{{Project|status=(beta)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}

HTML dumps ~~are an improvement in every way:~~ content, ~~structure~~ and ~~information are available in~~ a ~~form that can be read by ordinary tools—and the original wikitext~~ is ~~still available~~ as ~~RDFa annotations which makes~~ the ~~new format something like a superset~~.

A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of all articles on wiki. Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render. MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-generated templates. This recursively parsed content is effectively impossible to expand exactly as it appeared at the time it was written, once the templates and software it depends on have drifted.

~~At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https~~:~~//www.wikimedia.de/ Wikimedia Germany]~~, ~~we found a reason to parse these new dumps: it's the only reliable way to count the footnotes on each article. I'll go into some detail about why other data sources wouldn't have sufficed~~, and ~~also why we're counting footnotes~~ in the ~~first place~~.

HTML dumps are an improvement in every way: content, structure and information are expanded, frozen and made available in a form that can be read by ordinary tools—and the original wikitext is still available as RDFa annotations which makes the new format something like a superset.

~~{{Project~~|~~status=(in progress)|url=~~https://~~gitlab~~.~~com~~/~~wmde/technical-wishes/scrape-wiki-html-dump}}~~

At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a reason to dive into these new dumps: it's the only reliable way to count the footnotes on each article. I'll go into some detail about why other data sources wouldn't have sufficed, and also why we're counting footnotes in the first place.

== Reference parsing ==

=== What are references? ===

References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body.</ref> These footnotes ground the writing in sources, and are a ~~key~~ aspect of the intellectual ~~culture of Wikipedia since all claims are supposed to be paraphrased from existing secondary sources~~.

References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body. Often you will see a citation to a book or other source here.</ref> These footnotes ground the writing in sources, and are a distinctive aspect of the Wikipedias' intellectual cultures.

=== Why are we counting references? ===

The Wikimedia Germany Technical Wishes team has taken the past few months to [[metawikimedia:WMDE_Technical_Wishes/Reusing_references|focus on how references are reused]] on wikis. We have some ideas about what needs to be fixed (unfortunately this project is currently on hold), but first we needed to take measurements of the baseline situation in order to better understand how references are used, reused, and ~~so that we can tell if~~ our ~~interventions are~~ beneficial.

The Wikimedia Germany Technical Wishes team has taken the past few months to [[metawikimedia:WMDE_Technical_Wishes/Reusing_references|focus on how references are reused]] on wikis. We have some ideas about what needs to be fixed (unfortunately this project is currently on hold), but first we needed to take measurements of the baseline situation in order to better understand how references are used, reused, and to evaluate whether our potential intervention is beneficial.

Previous [[metawikimedia:Research:Characterizing_Wikipedia_Citation_Usage|research into citations]] has also measured the HTML-formatted articles, but HTML dumps weren't available at the time so this was accomplished by downloading ~~articles one at a time~~.

Previous [[metawikimedia:Research:Characterizing_Wikipedia_Citation_Usage|research into citations]] has also measured references by starting with the HTML-formatted articles, but HTML dumps weren't available at the time so this was accomplished by downloading each article individually.

For those interested in the ~~results~~, please skip ahead to the [https://phabricator.wikimedia.org/T332032#9011167 raw summary results]~~, but much more detail~~ and ~~analysis~~ will be published in the ~~future~~.

For those interested in the preliminary output of the scraper run, please skip ahead to the [https://phabricator.wikimedia.org/T332032#9011167 raw summary results]. Much detailed statistics for each wiki page and template will be published once we figure out longer-term hosting for the data.

=== ~~Challenging~~ to ~~find~~ references in wikitext ===

=== Obstacles to finding references in wikitext ===

A raw reference is straightforward in wikitext and looks like: <code><nowiki><ref>This footnote.</ref></nowiki></code>. If this were the end of the story, it would be simple to parse references. What makes it more complicated is that many references are produced using reusable templates, for example: <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>.

If "<nowiki>{{sfn}}</nowiki>" were the only template then we could search for "ref" tags and "sfn" templates in wikitext. But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different templates on English Wikipedia alone, and these ~~will be different on~~ every wiki and language edition.

If "<nowiki>{{sfn}}</nowiki>" were the only template used to produce references then we could search for "<nowiki><ref>" tags and "{{sfn}}</nowiki>" templates in wikitext. But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different ref-producing templates on English Wikipedia alone, and these are unique to every other wiki and language edition.

=== References are simple in HTML ===

Line 41:

The processing is broken down into small units which each write a single file. If a file already exists, then we skip the corresponding calculation. This general caching technique is known as [[w:Memoization|memoization]].

During each ~~step of processing~~, we also write to a ~~separate~~ checkpoint file ~~after every~~ 100 articles. The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash ~~results~~ in inconsistency. When resuming, if a checkpoint file is present then the job will skip articles until it catches up with the ~~number~~ given ~~there~~. The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is ~~that~~ some articles ~~might be duplicated in the output~~.

During each unit file, also write to a checkpoint file at multiples of 100 articles processed. The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash can result in inconsistency. When resuming, if a checkpoint file is present then the job will skip articles without processing them, until the total count catches up with the count given in the checkpoint file. The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is to duplicate work for some articles.

== Concurrency ==

Line 49:

The next obvious fork point would be in the phase which makes external API requests, but this is trickier because we want to limit total concurrency across all wikis as well, to avoid overwhelming the service. This should be implemented with a connection pool, ideally one which reuses a small number of connections according to HTTP/1.1 .

== Modularity ==

It will be no surprise that the analyses are run as separate units under a pluggable architecture, so that the tool can be reused for various tasks. The callbacks are crude for now and the abstraction is leaky, but it at least accomplishes code encapsulation and easily [https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/config/config.exs configurable] composition.

Modules must be written in Elixir, but we're also considering a language-agnostic callback if the need arises.

Some aspects of modularity were no fun so we ignored them. For example, [https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/metrics.md each metric is documented] in one big flat file. You'll also encounter some mild hardcoding such as an entire extra processing phase to make external API requests for parsed map data.

@@ Line 1: / Line 1: @@
-[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
+[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb|''Introverted Millipede'']]
-A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of all articles on wiki.  Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render.  MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-generated templates.  This recursively parsed content can never expanded exactly as it was intended when written.
+{{Project|status=(beta)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}
-HTML dumps are an improvement in every way: content, structure and information are available in a form that can be read by ordinary tools—and the original wikitext is still available as RDFa annotations which makes the new format something like a superset.
+A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of all articles on wiki.  Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render.  MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-generated templates.  This recursively parsed content is effectively impossible to expand exactly as it appeared at the time it was written, once the templates and software it depends on have drifted.
-At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a reason to parse these new dumps: it's the only reliable way to count the footnotes on each article.  I'll go into some detail about why other data sources wouldn't have sufficed, and also why we're counting footnotes in the first place.
+HTML dumps are an improvement in every way: content, structure and information are expanded, frozen and made available in a form that can be read by ordinary tools—and the original wikitext is still available as RDFa annotations which makes the new format something like a superset.
-{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}
+At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a reason to dive into these new dumps: it's the only reliable way to count the footnotes on each article.  I'll go into some detail about why other data sources wouldn't have sufficed, and also why we're counting footnotes in the first place.
 == Reference parsing ==
 === What are references? ===
-References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body.</ref>  These footnotes ground the writing in sources, and are a key aspect of the intellectual culture of Wikipedia since all claims are supposed to be paraphrased from existing secondary sources.
+References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body.  Often you will see a citation to a book or other source here.</ref>  These footnotes ground the writing in sources, and are a distinctive aspect of the Wikipedias' intellectual cultures.
 <references group="example-footnote" />
 === Why are we counting references? ===
-The Wikimedia Germany Technical Wishes team has taken the past few months to [[metawikimedia:WMDE_Technical_Wishes/Reusing_references|focus on how references are reused]] on wikis.  We have some ideas about what needs to be fixed (unfortunately this project is currently on hold), but first we needed to take measurements of the baseline situation in order to better understand how references are used, reused, and so that we can tell if our interventions are beneficial.
+The Wikimedia Germany Technical Wishes team has taken the past few months to [[metawikimedia:WMDE_Technical_Wishes/Reusing_references|focus on how references are reused]] on wikis.  We have some ideas about what needs to be fixed (unfortunately this project is currently on hold), but first we needed to take measurements of the baseline situation in order to better understand how references are used, reused, and to evaluate whether our potential intervention is beneficial.
-Previous [[metawikimedia:Research:Characterizing_Wikipedia_Citation_Usage|research into citations]] has also measured the HTML-formatted articles, but HTML dumps weren't available at the time so this was accomplished by downloading articles one at a time.
+Previous [[metawikimedia:Research:Characterizing_Wikipedia_Citation_Usage|research into citations]] has also measured references by starting with the HTML-formatted articles, but HTML dumps weren't available at the time so this was accomplished by downloading each article individually.
-For those interested in the results, please skip ahead to the [https://phabricator.wikimedia.org/T332032#9011167 raw summary results], but much more detail and analysis will be published in the future.
+For those interested in the preliminary output of the scraper run, please skip ahead to the [https://phabricator.wikimedia.org/T332032#9011167 raw summary results].  Much detailed statistics for each wiki page and template will be published once we figure out longer-term hosting for the data.
-=== Challenging to find references in wikitext ===
+=== Obstacles to finding references in wikitext ===
 A raw reference is straightforward in wikitext and looks like: <code><nowiki><ref>This footnote.</ref></nowiki></code>.  If this were the end of the story, it would be simple to parse references.  What makes it more complicated is that many references are produced using reusable templates, for example: <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>.
-If "<nowiki>{{sfn}}</nowiki>" were the only template then we could  search for "ref" tags and "sfn" templates in wikitext.  But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different templates on English Wikipedia alone, and these will be different on every wiki and language edition.
+If "<nowiki>{{sfn}}</nowiki>" were the only template used to produce references then we could search for "<nowiki><ref>" tags and "{{sfn}}</nowiki>" templates in wikitext.  But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different ref-producing templates on English Wikipedia alone, and these are unique to every other wiki and language edition.
 === References are simple in HTML ===
@@ Line 41: / Line 41: @@
 The processing is broken down into small units which each write a single file.  If a file already exists, then we skip the corresponding calculation.  This general caching technique is known as [[w:Memoization|memoization]].
-During each step of processing, we also write to a separate checkpoint file after every 100 articles.  The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash results in inconsistency.  When resuming, if a checkpoint file is present then the job will skip articles until it catches up with the number given there.  The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is that some articles might be duplicated in the output.
+During each unit file, also write to a checkpoint file at multiples of 100 articles processed.  The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash can result in inconsistency.  When resuming, if a checkpoint file is present then the job will skip articles without processing them, until the total count catches up with the count given in the checkpoint file.  The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is to duplicate work for some articles.
 == Concurrency ==
@@ Line 49: / Line 49: @@
 The next obvious fork point would be in the phase which makes external API requests, but this is trickier because we want to limit total concurrency across all wikis as well, to avoid overwhelming the service.  This should be implemented with a connection pool, ideally one which reuses a small number of connections according to HTTP/1.1 .
+== Modularity ==
+It will be no surprise that the analyses are run as separate units under a pluggable architecture, so that the tool can be reused for various tasks.  The callbacks are crude for now and the abstraction is leaky, but it at least accomplishes code encapsulation and easily [https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/config/config.exs configurable] composition.
+Modules must be written in Elixir, but we're also considering a language-agnostic callback if the need arises.
+Some aspects of modularity were no fun so we ignored them.  For example, [https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/metrics.md each metric is documented] in one big flat file.  You'll also encounter some mild hardcoding such as an entire extra processing phase to make external API requests for parsed map data.