Elixir/HTML dump scraper: Difference between revisions

Adamw (talk | contribs)
Add a short section on concurrency
Adamw (talk | contribs)
Improve introduction
Line 1: Line 1:
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of articles for most wikis.  Previously, only the raw wikitext was available for download and this format is notoriously difficult to make sense of because parsing depends on many layers of templates.  However, with the HTML dumps we can use standard tooling to extract structure and information—and the original wikitext is still available as RDFa annotations.
A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of all articles on wiki.  Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render.  MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-generated templates.  This recursively parsed content can never expanded exactly as it was intended when written.


At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a motivation to parse these new dumps: it's the only reliable way to count the footnotes.  I'll go into some detail about why other data sources wouldn't be sufficient, because it showcases the challenges of wikitext and the relative simplicity and dependability of HTML dumps.
HTML dumps are an improvement in every way: content, structure and information are available in a form that can be read by ordinary tools—and the original wikitext is still available as RDFa annotations which makes the new format something like a superset.
 
At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a reason to parse these new dumps: it's the only reliable way to count the footnotes on each article.  I'll go into some detail about why other data sources wouldn't have sufficed, and also why we're counting footnotes in the first place.


{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}
{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}