Elixir/HTML dump scraper: Difference between revisions

Adamw (talk | contribs)
Start a draft about the scraper
 
Adamw (talk | contribs)
No edit summary
Line 1: Line 1:
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg]]
A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis.  Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of.  With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations.
A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis.  Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of.  With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations.