Draft:Elixir/bzip2-ex: Difference between revisions

← Older edit

VisualWikitext

Latest revision as of 12:48, 14 February 2025

An adventure story of my first Erlang/Elixir library binding (NIF).

Adam Wight, Sept 2022

Project link:
https://gitlab.com/adamwight/bzip2-ex

Problem statement

I wanted to process some large, compressed files containing Wikipedia content^[1], which couldn't be expanded in-place. The typical approach to this problem is to stream the decompressed data through the desired analysis in memory and then throw it away.

Decompression can be accomplished by piping through an external, command-line tool or by reading the file using a native Elixir codec. In my case, I chose to mix these approaches by untarring through tar using a Port, but writing a native bzip2 library to perform the decompression, since none existed at the time.

In hindsight, it would have been much simpler to use command-line bunzip2. The native library should make it possible to use backpressure and concurrency. But mostly I just got excited about a small gap in the BEAM ecosystem and wanted to teach myself how to write an Erlang native implemented function, or NIF^[2].

How hard could it be to write a little binding...

Mysteries of libbzip2

The first interesting obstacle was that development of official bzip2 has stopped at the last stable release, with v1.0.x in 2019.^[3] A new group of people has been working towards a fork^[4] that they're calling version "1.1" but hopefully will avoid breaking changes to the programming interface. This is still unreleased as of 2025.

The second point worth mentioning is that the bzip2 file format has no formal specification. This situation is pretty common and I can't complain, because there's a brilliant reverse-engineering^[5] effort which included the details I needed.

High- or low-level integration?

As I mentioned, my own project was already in hard mode due to the decision to write a NIF at all. But there was a second choice, between a the libbzip2 high-level interface^[6] which does everything for you: open the file and return its contents decompressed, or the low-level interface^[7] which works with block or even sub-block chunks of data.

Here I learned the most important requirement of a NIF binding: it does work within the BEAM memory and process space but it must return control to the Elixir scheduler within a very short time period, less than 100ms or so. Low-level it is, then!

If you want to look into yet another approach, Moosieus^[8] has written an Elixir binding for pure Rust bzip2-rs^[9]. This looks good for decompression, but executes in a single run rather than streaming.

Native implemented function (NIF)

TODO...

Parallel processing

TODO: Stream vs block, what can write multiple streams, what are the challenges of detecting blocks...

References

[1] ttps://dumps.wikimedia.org/backup-index.html

[2] ttps://www.erlang.org/doc/apps/erts/erl_nif

[3] The project page for bzip2 v1.0 is https://sourceware.org/bzip2/.

[4] ttps://gitlab.com/bzip2/bzip2/

[5] ttps://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf

[6] ttps://sourceware.org/bzip2/manual/manual.html#hl-interface

[7] ttps://sourceware.org/bzip2/manual/manual.html#low-level

[8] ttps://github.com/Moosieus/bzip2_decomp

[9] ttps://github.com/paolobarbolini/bzip2-rs

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

@@ Line 1: / Line 1: @@
-A chronicle of my first Erlang/Elixir library binding (NIF).
+An adventure story of my first Erlang/Elixir library binding (NIF).
 ''Adam Wight, Sept 2022''
-== Background ==
+{{Project|url=https://gitlab.com/adamwight/bzip2-ex}}
-One common way to analyze Wikipedia content is to process its [https://dumps.wikimedia.org/backup-index.html database backup dumps], which are provided XML compressed using [[W:bzip2|bzip2]].  The unpacked files are too large to manipulate locally, and even the compressed files can be unwieldly, so a common practice is to stream the content, decompress in memory, and analyze in a single pass.  The output might be to extract a smaller subset of the data.
-Of course, the normal approach would be to search for a proven library in a solid data-sciencey language such as Python, pip install [https://pythonhosted.org/mwxml/ mwxml] and go on with the day.  But that wouldn't be as interesting as trying to do exactly the same thing in an esoteric young language, in service of a [https://gitlab.com/adamwight/mediawiki_client_ex pet project] with zero adoption.
+== Problem statement ==
-[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60)]]
+[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60).  [This painting is not titled, "Picking the low-hanging fruit". -AW]]I wanted to process some large, compressed files containing Wikipedia content<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which couldn't be expanded in-place.  The typical approach to this problem is to stream the decompressed data through the desired analysis in memory and then throw it away.
-But imagine how much better a truly concurrent processor could be!
+Decompression can be accomplished by piping through an external, command-line tool or by reading the file using a native Elixir codec.  In my case, I chose to mix these approaches by untarring through tar using a Port, but writing a native bzip2 library to perform the decompression, since none existed at the time.
+In hindsight, it would have been much simpler to use command-line bunzip2.  The native library should make it possible to use backpressure and concurrency.  But mostly I just got excited about a small gap in the BEAM ecosystem and wanted to teach myself how to write an Erlang native implemented function, or NIF<ref>https://www.erlang.org/doc/apps/erts/erl_nif</ref>.
+How hard could it be to write a little binding...
+==Mysteries of libbzip2==
+The first interesting obstacle was that development of official bzip2 has stopped at the last stable release, with v1.0.x in 2019.<ref>The project page for bzip2 v1.0 is https://sourceware.org/bzip2/.</ref>  A new group of people has been working towards a fork<ref>https://gitlab.com/bzip2/bzip2/</ref> that they're calling version "1.1" but hopefully will avoid breaking changes to the programming interface.  This is still unreleased as of 2025.
+The second point worth mentioning is that the bzip2 file format has no formal specification.  This situation is pretty common and I can't complain, because there's a brilliant reverse-engineering<ref>https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf</ref> effort which included the details I needed.
+==High- or low-level integration?==
+As I mentioned, my own project was already in hard mode due to the decision to write a NIF at all.  But there was a second choice, between a the libbzip2 high-level interface<ref>https://sourceware.org/bzip2/manual/manual.html#hl-interface</ref> which does everything for you: open the file and return its contents decompressed, or the low-level interface<ref>https://sourceware.org/bzip2/manual/manual.html#low-level</ref> which works with block or even sub-block chunks of data.
+Here I learned the most important requirement of a NIF binding: it does work within the BEAM memory and process space but it must return control to the Elixir scheduler within a very short time period, less than 100ms or so.  Low-level it is, then!
+If you want to look into yet another approach, Moosieus<ref>https://github.com/Moosieus/bzip2_decomp</ref> has written an Elixir binding for pure Rust bzip2-rs<ref>https://github.com/paolobarbolini/bzip2-rs</ref>.  This looks good for decompression, but executes in a single run rather than streaming.
-== Problem statement ==
+==Native implemented function (NIF)==
-Unfortunately, there's no BEAM (Elixir- and Erlang-compatible) library for reading bzip2 files, so the options would be to fetch and run the data through an external, bidirectional pipe, or write the bindings.
+[[File:Potato_sprout,_January_23,_2006.jpg|right|267x267px]]TODO...
+== Parallel processing ==
+TODO: Stream vs block, what can write multiple streams, what are the challenges of detecting blocks...
-How hard could it be to write a binding...
+==References==
-== The bzip2 file format has no specification ==
+<references />