Draft:Elixir/bzip2-ex

From ludd
Revision as of 20:40, 7 September 2022 by Adamw (talk | contribs) (provide background)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

A chronicle of my first Erlang/Elixir library binding (NIF).

Background

One common way to analyze Wikipedia content is to process its database backup dumps, which are provided XML compressed using bzip2. The unpacked files are too large to manipulate locally, and even the compressed files can be unwieldly, so a common practice is to stream the content, decompress in memory, and analyze in a single pass. The output might be to extract a smaller subset of the data.

Of course, the normal approach would be to search for a proven library in a solid data-sciencey language such as Python, pip install mwxml and go on with the day. But that wouldn't be as interesting as trying to do exactly the same thing in an esoteric young language, in service of a pet project with zero adoption.

Problem statement

Unfortunately, there's no BEAM (Elixir- and Erlang-compatible) library for reading bzip2 files, so the options would be to fetch and run the data through an external, bidirectional pipe, or write the bindings.

How hard could it be to write a binding...

The bzip2 file format has no specification