Elixir/HTML dump scraper: Difference between revisions

Line 1:

[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]

{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}

A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of all articles on wiki. Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render. MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-generated templates. This recursively parsed content is effectively impossible to expand exactly as it appeared at the time it was written, once the templates and software it depends on have drifted.

Line 5:

Line 7:

At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a reason to dive into these new dumps: it's the only reliable way to count the footnotes on each article. I'll go into some detail about why other data sources wouldn't have sufficed, and also why we're counting footnotes in the first place.

~~{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}~~

== Reference parsing ==

@@ Line 1: / Line 1: @@
 [[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
+{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}
 A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic [https://dumps.wikimedia.org/other/enterprise_html/ HTML dumps] of all articles on wiki.  Previously, archived wiki content was only available as raw wikitext, which is notoriously difficult to parse for information, or even to render.  MediaWiki's wikitext depends on a number of extensions and is usually tied to a specific site's configuration and user-generated templates.  This recursively parsed content is effectively impossible to expand exactly as it appeared at the time it was written, once the templates and software it depends on have drifted.
@@ Line 5: / Line 7: @@
 At my day job doing [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for [https://www.wikimedia.de/ Wikimedia Germany], we found a reason to dive into these new dumps: it's the only reliable way to count the footnotes on each article.  I'll go into some detail about why other data sources wouldn't have sufficed, and also why we're counting footnotes in the first place.
-{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}
 == Reference parsing ==