Elixir/HTML dump scraper: Difference between revisions

Revision as of 08:50, 25 March 2023

A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis. Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of. With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations.

At my day job working for Wikimedia Germany's Technical Wishes, we found a motivation to parse these dumps: it's the only reliable way to tally how reference footnotes are used. I'll go into some detail about why other data sources wouldn't be sufficient, because it showcases the challenges of wikitext and the benefits of HTML dumps.

Project source code (in progress): https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump

Revision as of 08:24, 25 March 2023 view source Adamw (talk \| contribs) 201 edits Start a draft about the scraper Tag: Visual edit		Revision as of 08:50, 25 March 2023 view source Adamw (talk \| contribs) 201 edits No edit summary Newer edit →
Line 1:		Line 1:
	[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg~~\|thumb~~]]		[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg]]
	A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis. Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of. With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations.		A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis. Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of. With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations.