Elixir/HTML dump scraper: Difference between revisions

Adamw (talk | contribs)
Improve introduction
Adamw (talk | contribs)
Explain references more
Line 15: Line 15:
<references group="example-footnote" />
<references group="example-footnote" />


=== Challenges in wikitext ===
=== Why are we counting references? ===
The Wikimedia Germany Technical Wishes team has taken the past few months to [[metawikimedia:WMDE_Technical_Wishes/Reusing_references|focus on how references are reused]] on wikis.  We have some ideas about what needs to be fixed (unfortunately this project is currently on hold), but first we needed to take measurements of the baseline situation in order to better understand how references are used, reused, and so that we can tell if our interventions are beneficial.
 
Previous [[metawikimedia:Research:Characterizing_Wikipedia_Citation_Usage|research into citations]] has also measured the HTML-formatted articles, but HTML dumps weren't available at the time so this was accomplished by downloading articles one at a time.
 
For those interested in the results, please skip ahead to the [https://phabricator.wikimedia.org/T332032#9011167 raw summary results], but much more detail and analysis will be published in the future.
 
=== Challenging to find references in wikitext ===
A raw reference is straightforward in wikitext and looks like: <code><nowiki><ref>This footnote.</ref></nowiki></code>.  If this were the end of the story, it would be simple to parse references.  What makes it more complicated is that many references are produced using reusable templates, for example: <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>.
A raw reference is straightforward in wikitext and looks like: <code><nowiki><ref>This footnote.</ref></nowiki></code>.  If this were the end of the story, it would be simple to parse references.  What makes it more complicated is that many references are produced using reusable templates, for example: <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>.


If "<nowiki>{{sfn}}</nowiki>" were the only template then we could  search for "ref" tags and "sfn" templates in wikitext.  But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different templates on English Wikipedia alone, and these will be different on every wiki and language edition.
If "<nowiki>{{sfn}}</nowiki>" were the only template then we could  search for "ref" tags and "sfn" templates in wikitext.  But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different templates on English Wikipedia alone, and these will be different on every wiki and language edition.


=== Simplicity of HTML ===
=== References are simple in HTML ===
Once the wikitext is fully rendered to HTML, we can finally see all of the footnotes which were produced.  They appear something like this, <code><nowiki><div typeof="mw:Extension/ref">Footnote text.</div></nowiki></code>
Once the wikitext is fully rendered to HTML, we can finally see all of the footnotes which were produced.  They appear something like this, <code><nowiki><div typeof="mw:Extension/ref">Footnote text.</div></nowiki></code>