Draft:Elixir/bzip2-ex: Difference between revisions
No edit summary |
C/e Tags: Mobile edit Mobile web edit Visual edit |
||
| Line 6: | Line 6: | ||
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60)]] | [[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60)]] | ||
One common way to analyze Wikipedia content is to process its database backup dumps<ref>https://dumps.wikimedia.org/backup-index.html</ref>, | One common way to analyze Wikipedia content is to process its database backup dumps<ref>https://dumps.wikimedia.org/backup-index.html</ref>, provided as [[W:bzip2|bzip2]]-compressed XML. The unpacked files are too large to manipulate locally and unwieldly even when compressed, so a common practice is to stream the content, decompress in memory, and analyze in a single pass. Perhaps this step extracts a smaller subset of the data and saves it to disk. | ||
Of course, | Of course, one would normally search for a proven library written in a solid data-sciencey language such as Python, run pip install mwxml<ref>https://pythonhosted.org/mwxml/</ref> and go on with the day. | ||
That wouldn't be as interesting as trying to do exactly the same thing in an esoteric young language, in service of a pet project<ref>https://gitlab.com/adamwight/mediawiki_client_ex</ref> with zero adoption. | |||
<nowiki>"But imagine how much better a truly concurrent wiki dump processor could be!" —~~~</nowiki> | <nowiki>"But imagine how much better a truly concurrent wiki dump processor could be!" —~~~</nowiki> | ||