Jump to content
Main menu
Main menu
move to sidebar
hide
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
ludd
Search
Search
English
Create account
Log in
Personal tools
Create account
Log in
Pages for logged out editors
learn more
Contributions
Talk
Editing
Elixir/HTML dump scraper
(section)
Add languages
Page
Discussion
English
Read
Edit
Edit source
View history
Tools
Tools
move to sidebar
hide
Actions
Read
Edit
Edit source
View history
General
What links here
Related changes
Special pages
Page information
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Reference parsing == === What are references? === References are the little footnotes all over Wikipedia articles:<ref group="example-footnote">This is a footnote body. Often you will see a citation to a book or other source here.</ref> These footnotes ground the writing in sources, and are a distinctive aspect of the Wikipedias' intellectual cultures. <references group="example-footnote" /> === Why are we counting references? === The Wikimedia Germany Technical Wishes team has taken the past few months to [[metawikimedia:WMDE_Technical_Wishes/Reusing_references|focus on how references are reused]] on wikis. We have some ideas about what needs to be fixed (unfortunately this project is currently on hold), but first we needed to take measurements of the baseline situation in order to better understand how references are used, reused, and to evaluate whether our potential intervention is beneficial. Previous [[metawikimedia:Research:Characterizing_Wikipedia_Citation_Usage|research into citations]] has also measured references by starting with the HTML-formatted articles, but HTML dumps weren't available at the time so this was accomplished by downloading each article individually. For those interested in the preliminary output of the scraper run, please skip ahead to the [https://phabricator.wikimedia.org/T332032#9011167 raw summary results]. Much detailed statistics for each wiki page and template will be published once we figure out longer-term hosting for the data. === Obstacles to finding references in wikitext === A raw reference is straightforward in wikitext and looks like: <code><nowiki><ref>This footnote.</ref></nowiki></code>. If this were the end of the story, it would be simple to parse references. What makes it more complicated is that many references are produced using reusable templates, for example: <code><nowiki>{{sfn|Hacker|Grimwood|2011|p=290}}</nowiki></code>. If "<nowiki>{{sfn}}</nowiki>" were the only template used to produce references then we could search for "<nowiki><ref>" tags and "{{sfn}}</nowiki>" templates in wikitext. But a [https://en.wikipedia.org/w/index.php?search=insource%3A%3Cref%3E+insource%3A%2F%5B%3C%5Dref%5B%3E%5D%2F+-insource%3A%2Finclude+%5B%3C%5Dref%5B%3E%5D%2F&title=Special:Search&profile=advanced&fulltext=1&ns10=1 search for reference-producing templates] unveils over 12,000 different ref-producing templates on English Wikipedia alone, and these are unique to every other wiki and language edition. === References are simple in HTML === Once the wikitext is fully rendered to HTML, we can finally see all of the footnotes which were produced. They appear something like this, <code><nowiki><div typeof="mw:Extension/ref">Footnote text.</div></nowiki></code> Since the rendering is complete, we know exactly which references are visible, which is better than the ''potential'' references that we might have been able to determine from a static analysis of each template. Template expansion also maps to HTML hierarchical structure, which makes it possible to tell when a reference was produced by templates or when a reference contains templates. Both of these cases are interesting to our research.
Summary:
Please note that all contributions to ludd may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Ludd:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Toggle limited content width