Draft:Elixir/bzip2-ex: Difference between revisions

Adamw (talk | contribs)
No edit summary
Adamw (talk | contribs)
C/e
Tags: Mobile edit Mobile web edit Visual edit
Line 6: Line 6:
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60)]]
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60)]]


One common way to analyze Wikipedia content is to process its database backup dumps<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which are provided XML compressed using [[W:bzip2|bzip2]].  The unpacked files are too large to manipulate locally, and even the compressed files can be unwieldly, so a common practice is to stream the content, decompress in memory, and analyze in a single pass.  The output might be to extract a smaller subset of the data.
One common way to analyze Wikipedia content is to process its database backup dumps<ref>https://dumps.wikimedia.org/backup-index.html</ref>, provided as [[W:bzip2|bzip2]]-compressed XML.  The unpacked files are too large to manipulate locally and unwieldly even when compressed, so a common practice is to stream the content, decompress in memory, and analyze in a single pass.  Perhaps this step extracts a smaller subset of the data and saves it to disk.


Of course, the normal approach would be to search for a proven library in a solid data-sciencey language such as Python, pip install mwxml<ref>https://pythonhosted.org/mwxml/</ref> and go on with the day. That wouldn't be as interesting as trying to do exactly the same thing in an esoteric young language, in service of a pet project<ref>https://gitlab.com/adamwight/mediawiki_client_ex</ref> with zero adoption.
Of course, one would normally search for a proven library written in a solid data-sciencey language such as Python, run pip install mwxml<ref>https://pythonhosted.org/mwxml/</ref> and go on with the day.
 
That wouldn't be as interesting as trying to do exactly the same thing in an esoteric young language, in service of a pet project<ref>https://gitlab.com/adamwight/mediawiki_client_ex</ref> with zero adoption.


<nowiki>"But imagine how much better a truly concurrent wiki dump processor could be!" —~~~</nowiki>
<nowiki>"But imagine how much better a truly concurrent wiki dump processor could be!" —~~~</nowiki>