Elixir/HTML dump scraper: Difference between revisions
No edit summary |
m →What are references?: play with language |
||
| Line 11: | Line 11: | ||
=== What are references? === | === What are references? === | ||
References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body.</ref> These footnotes ground the writing in sources, and are a | References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body. Often you will see a citation to a book or other source here.</ref> These footnotes ground the writing in sources, and are a distinctive aspect of the Wikipedias' intellectual cultures. | ||
<references group="example-footnote" /> | <references group="example-footnote" /> | ||
=== Why are we counting references? === | === Why are we counting references? === | ||
The Wikimedia Germany Technical Wishes team has taken the past few months to [[metawikimedia:WMDE_Technical_Wishes/Reusing_references|focus on how references are reused]] on wikis. We have some ideas about what needs to be fixed (unfortunately this project is currently on hold), but first we needed to take measurements of the baseline situation in order to better understand how references are used, reused, and | The Wikimedia Germany Technical Wishes team has taken the past few months to [[metawikimedia:WMDE_Technical_Wishes/Reusing_references|focus on how references are reused]] on wikis. We have some ideas about what needs to be fixed (unfortunately this project is currently on hold), but first we needed to take measurements of the baseline situation in order to better understand how references are used, reused, and to evaluate whether our potential intervention is beneficial. | ||
Previous [[metawikimedia:Research:Characterizing_Wikipedia_Citation_Usage|research into citations]] has also measured the HTML-formatted articles, but HTML dumps weren't available at the time so this was accomplished by downloading | Previous [[metawikimedia:Research:Characterizing_Wikipedia_Citation_Usage|research into citations]] has also measured references by starting with the HTML-formatted articles, but HTML dumps weren't available at the time so this was accomplished by downloading each article individually. | ||
For those interested in the | For those interested in the preliminary output of the scraper run, please skip ahead to the [https://phabricator.wikimedia.org/T332032#9011167 raw summary results]. Much detailed statistics for each wiki page and template will be published once we figure out longer-term hosting for the data. | ||
=== Obstacles to finding references in wikitext === | === Obstacles to finding references in wikitext === | ||
A raw reference is straightforward in wikitext and looks like: <code><nowiki><ref>This footnote.</ref></nowiki></code>. If this were the end of the story, it would be simple to parse references. What makes it more complicated is that many references are produced using reusable templates, for example: <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>. | A raw reference is straightforward in wikitext and looks like: <code><nowiki><ref>This footnote.</ref></nowiki></code>. If this were the end of the story, it would be simple to parse references. What makes it more complicated is that many references are produced using reusable templates, for example: <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>. | ||
If "<nowiki>{{sfn}}</nowiki>" were the only template then we could | If "<nowiki>{{sfn}}</nowiki>" were the only template used to produce references then we could search for "<nowiki><ref>" tags and "{{sfn}}</nowiki>" templates in wikitext. But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different ref-producing templates on English Wikipedia alone, and these are unique to every other wiki and language edition. | ||
=== References are simple in HTML === | === References are simple in HTML === | ||