Elixir/HTML dump scraper: Difference between revisions

Adamw (talk | contribs)
Talk about modular design.
typo
Line 1: Line 1:
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of all articles on wiki.  Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render.  MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-generated templates.  This recursively parsed content can never expanded exactly as it was intended when written.
A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of all articles on wiki.  Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render.  MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-generated templates.  This recursively parsed content is effectively impossible to expand exactly as it appeared at the time it was written, once the templates and software it depends on have drifted.


HTML dumps are an improvement in every way: content, structure and information are available in a form that can be read by ordinary tools—and the original wikitext is still available as RDFa annotations which makes the new format something like a superset.
HTML dumps are an improvement in every way: content, structure and information are expanded, frozen and made available in a form that can be read by ordinary tools—and the original wikitext is still available as RDFa annotations which makes the new format something like a superset.


At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a reason to parse these new dumps: it's the only reliable way to count the footnotes on each article.  I'll go into some detail about why other data sources wouldn't have sufficed, and also why we're counting footnotes in the first place.
At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a reason to dive into these new dumps: it's the only reliable way to count the footnotes on each article.  I'll go into some detail about why other data sources wouldn't have sufficed, and also why we're counting footnotes in the first place.


{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}
{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}