Elixir/HTML dump scraper: Difference between revisions
some examples |
WIP explain transcluded refs |
||
Line 2: | Line 2: | ||
A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis. Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of. With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations. | A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis. Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of. With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations. | ||
At my day job working | At my day job working on [[metawikimedia:WMDE_Technical_Wishes|Technical Wishes]] for Wikimedia Germany, we found a motivation to parse these dumps: it's the only reliable way to tally reference footnotes. I'll go into some detail about why other data sources wouldn't be sufficient, because it showcases the challenges of wikitext and the relative simplicity and dependability of HTML dumps. | ||
{{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}} | {{Project|status=(in progress)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}} | ||
=== What are references? === | === What are references? === | ||
References are the little footnotes all over Wikipedia articles | References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body.</ref> Citations ground the writing in sources, which are especially important on Wikipedia because of the rule against so-called "original research". Factual claims are supposed to be paraphrased from existing secondary sources. | ||
A raw reference looks like <code><nowiki><ref>This footnote.</ref></nowiki></code>. | <references group="example-footnote" /> | ||
A raw reference looks like <code><nowiki><ref>This footnote.</ref></nowiki></code>. Most references are fancier, and many rely on reusable structures called templates. Here's a short example of a template that would produce a footnote, <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>. | |||
=== Challenges of wikitext === | === Challenges of wikitext === | ||
If "<nowiki>{{sfn}}</nowiki>" were the only template then we could just search for "ref" tags and "sfn" templates in wikitext. However, a conservative [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] shows over 12,000 on English Wikipedia alone, and not counting those that differ on every other wiki and language. | |||
Only the fully-rendered HTML article shows the final footnotes. |
Revision as of 18:21, 25 March 2023
A new and wondrous data source has become available to Wikimedia researchers and hobbyists: semantic HTML dumps of articles for most wikis. Previously, only the bare wikitext was available for download and this was notoriously difficult to make sense of. With the HTML dumps, standard tooling is used to extract many types of structure and information—and the wikitext is still present as annotations.
At my day job working on Technical Wishes for Wikimedia Germany, we found a motivation to parse these dumps: it's the only reliable way to tally reference footnotes. I'll go into some detail about why other data sources wouldn't be sufficient, because it showcases the challenges of wikitext and the relative simplicity and dependability of HTML dumps.
Project link ((in progress)):
https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump
What are references?
References are the little footnotes all over Wikipedia articles:[example-footnote 1] Citations ground the writing in sources, which are especially important on Wikipedia because of the rule against so-called "original research". Factual claims are supposed to be paraphrased from existing secondary sources.
- ↑ This is a footnote body.
A raw reference looks like <ref>This footnote.</ref>
. Most references are fancier, and many rely on reusable structures called templates. Here's a short example of a template that would produce a footnote, {{sfn|Hacker|Grimwood|2011|p=290}}
.
Challenges of wikitext
If "{{sfn}}" were the only template then we could just search for "ref" tags and "sfn" templates in wikitext. However, a conservative search for reference-producing templates shows over 12,000 on English Wikipedia alone, and not counting those that differ on every other wiki and language.
Only the fully-rendered HTML article shows the final footnotes.