Draft:Elixir/bzip2-ex: Difference between revisions

Revision as of 22:26, 7 September 2022

A chronicle of my first Erlang/Elixir library binding (NIF).

Adam Wight, Sept 2022

Background

One common way to analyze Wikipedia content is to process its database backup dumps^[1], provided as bzip2-compressed XML. The unpacked files are too large to manipulate locally and unwieldly even when compressed, so a common practice is to stream the content, decompress in memory, and analyze in a single pass. Perhaps this step extracts a smaller subset of the data and saves it to disk.

Of course, one would normally search for a proven library written in a solid data-sciencey language such as Python, run pip install mwxml^[2] and go on with the day.

That wouldn't be as interesting as trying to do exactly the same thing in an esoteric young language, in service of a pet project^[3] with zero adoption.

"But imagine how much better a truly concurrent wiki dump processor could be!" —~~~

Problem statement

Unfortunately, there's no BEAM (Elixir- and Erlang-compatible) library for reading bzip2 files, so the options are either to fetch and run the data through an external process and bidirectional pipes, or write a new library.

How hard could it be to write a binding...

Is libbzip2 maintained?

Um, sure... The official bzip2 pages^[4] might be hosted on a Red Hat site nobody has heard of since 2005, very likely written as static HTML, the project has no edits to its changelog^[5] since 2019 (don't worry, the code has been changed^[6] even if the changelog has not), but yes bzip2 1.0 is the current version dammit!

What isn't mentioned on the official page, but is well documented in the Wikipedia article, is that ownership has peacefully moved to a new author (while slipping through another would-be maintainer's fingertips) who will kindly take us into the future with bzip2 1.1 and beyond^[7]—once this is released and mainstream, please edit my comments here.

Don't be confused just because I am.

The bzip2 file format has no specification

That's a bigger problem and we will ignore until it comes back angry—which it does.

High- or low-level integration?

We already chose the "hard landing" by writing a new library in the first place, but there's still an opportunity for making life sort of easy again, by choosing one of two APIs provided by libbzip2, either the high-level^[8] interface which opens a file and returns the decompressed contents, or the low-level^[9] interface for maximum interactivity, which is called with tiny bites of data, shares its memory with the caller, exposes an internal state machine, and offers your humble developer many forms of additional stimulation.

There were a few considerations here, and in theory the high-level interface would have made a perfectly acceptable binding. However, this would be a single call to decompress the world, and control flow wouldn't return to Elixir for several hours. This pretty much defeats the purpose of writing a binding library for a concurrent virtual machine.

Native implemented function (NIF)

This is where I began to feel like I was learning stuff.

Elixir owes its existence to Erlang and the virtual machine that runs them both, BEAM^[10]. Extending the machine is done by writing a NIF library^[11] which consists of compiled C/C++ and a small Erlang stub providing the interface. Actually, the stub can be written in Elixir just as well, which was a nice surprise.

References

[1] ttps://dumps.wikimedia.org/backup-index.html

[2] ttps://pythonhosted.org/mwxml/

[3] ttps://gitlab.com/adamwight/mediawiki_client_ex

[4] ttps://sourceware.org/bzip2/

[5] ttps://sourceware.org/bzip2/CHANGES

[6] ttps://sourceware.org/git/?p=bzip2.git

[7] ttps://gitlab.com/bzip2/bzip2/

[8] ttps://sourceware.org/bzip2/manual/manual.html#hl-interface

[9] ttps://sourceware.org/bzip2/manual/manual.html#low-level

[10] ttps://blog.stenmans.org/theBeamBook/#_the_erlang_virtual_machine_beam

[11] ttps://www.erlang.org/doc/man/erl_nif.html

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

@@ Line 6: / Line 6: @@
 [[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60)]]
-One common way to analyze Wikipedia content is to process its database backup dumps<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which are provided XML compressed using [[W:bzip2|bzip2]].  The unpacked files are too large to manipulate locally, and even the compressed files can be unwieldly, so a common practice is to stream the content, decompress in memory, and analyze in a single pass.  The output might be to extract a smaller subset of the data.
+One common way to analyze Wikipedia content is to process its database backup dumps<ref>https://dumps.wikimedia.org/backup-index.html</ref>, provided as [[W:bzip2|bzip2]]-compressed XML.  The unpacked files are too large to manipulate locally and unwieldly even when compressed, so a common practice is to stream the content, decompress in memory, and analyze in a single pass.  Perhaps this step extracts a smaller subset of the data and saves it to disk.
-Of course, the normal approach would be to search for a proven library in a solid data-sciencey language such as Python, pip install mwxml<ref>https://pythonhosted.org/mwxml/</ref> and go on with the day.  That wouldn't be as interesting as trying to do exactly the same thing in an esoteric young language, in service of a pet project<ref>https://gitlab.com/adamwight/mediawiki_client_ex</ref> with zero adoption.
+Of course, one would normally search for a proven library written in a solid data-sciencey language such as Python, run pip install mwxml<ref>https://pythonhosted.org/mwxml/</ref> and go on with the day.
+That wouldn't be as interesting as trying to do exactly the same thing in an esoteric young language, in service of a pet project<ref>https://gitlab.com/adamwight/mediawiki_client_ex</ref> with zero adoption.
 <nowiki>"But imagine how much better a truly concurrent wiki dump processor could be!" —~~~</nowiki>