Jump to content

Draft:Elixir/bzip2-ex: Difference between revisions

From ludd
Adamw (talk | contribs)
Link to project source
Adamw (talk | contribs)
rewrite
Line 5: Line 5:
{{Project|url=https://gitlab.com/adamwight/bzip2-ex}}
{{Project|url=https://gitlab.com/adamwight/bzip2-ex}}


== Background ==
== Problem statement ==
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60).  [This painting is not titled, "Picking the low-hanging fruit". -AW]]One common way to analyze Wikipedia content is to mung its database backup dumps<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which are provided as [[W:bzip2|bzip2]]-compressed XMLThese files are too large to use unpacked and unwieldy even when compressed, so are best served streaming.
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60).  [This painting is not titled, "Picking the low-hanging fruit". -AW]]I wanted to process some large, compressed files containing Wikipedia content<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which couldn't be expanded in placeThe typical approach to this problem is to stream the decompressed data through the desired analysis in memory and then throw it away.


== Problem statement==
Decompression can be accomplished by piping through an external, command-line tool or by reading the file using a native Elixir codec.  In my case, I chose to mix these approaches by untarring through tar using a Port, but use a native bzip2 library to perform the decompression.
Unfortunately, there was no BEAM (Elixir- and Erlang-compatible) library for reading bzip2 files, so we could either fetch and decompress the data externally through an separate process, or write a glue library exposing libbzip2 functions in Elixir.
 
In hindsight, it would have been much simpler to use command-line bunzip2.  The native library should make it possible to use backpressure and concurrency.  But mostly I just got excited about a small gap in the BEAM ecosystem and wanted to teach myself how to write an Erlang native implemented function, or NIF<ref>https://www.erlang.org/doc/apps/erts/erl_nif</ref>.


How hard could it be to write a little binding...
How hard could it be to write a little binding...


==Is libbzip2 maintained?==
==Mysteries of libbzip2==
 
Development of bzip2 has stopped at the last stable release, with v1.0.x in 2019.<ref>The project page for bzip2 v1.0 is https://sourceware.org/bzip2/.</ref>  A new group has been working towards an initial fork<ref>https://gitlab.com/bzip2/bzip2/</ref> that they're calling version "1.1" but hopefully will avoid breaking changes to the programming interface.


== The bzip2 file format has no specification ==
The first interesting obstacle was that development of official bzip2 has stopped at the last stable release, with v1.0.x in 2019.<ref>The project page for bzip2 v1.0 is https://sourceware.org/bzip2/.</ref>  A new group of people has been working towards a fork<ref>https://gitlab.com/bzip2/bzip2/</ref> that they're calling version "1.1" but hopefully will avoid breaking changes to the programming interface.  This is still unreleased as of 2025.
I can't complain, there's a brilliant reverse-engineering effort<ref>https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf</ref>.


<br clear="all">
The second point worth mentioning is that the bzip2 file format has no formal specification.  This situation is pretty common and I can't complain, because there's a brilliant reverse-engineering<ref>https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf</ref> effort which included the details I needed.


==High- or low-level integration?==
==High- or low-level integration?==
We already chose the "hard landing" by writing a new library in the first place, but there's still an opportunity for making life sort of easy again, by choosing one of two APIs provided by libbzip2, either the high-level<ref>https://sourceware.org/bzip2/manual/manual.html#hl-interface</ref> interface which opens a file and returns the decompressed contents, or the low-level<ref>https://sourceware.org/bzip2/manual/manual.html#low-level</ref> interface for maximum interactivity, which is called with tiny bites of data, shares its memory with the caller, exposes an internal state machine, and offers your humble developer many forms of additional stimulation.
As I mentioned, my own project was already in hard mode due to the decision to write a NIF at all.  But there was a second choice, between a the libbzip2 high-level interface<ref>https://sourceware.org/bzip2/manual/manual.html#hl-interface</ref> which does everything for you: open the file and return its contents decompressed, or the low-level interface<ref>https://sourceware.org/bzip2/manual/manual.html#low-level</ref> which works with block or even sub-block chunks of data.


There were a few considerations here, and in theory the high-level interface would have made a perfectly acceptable binding.  However, this would be a single call to decompress the world, and control flow wouldn't return to Elixir for several hoursThis pretty much defeats the purpose of writing a binding library for a concurrent virtual machine.
Here I learned the most important requirement of a NIF binding: it does work within the BEAM memory and process space but it must return control to the Elixir scheduler within a very short time period, less than 100ms or soLow-level it is, then!


==Native implemented function (NIF)==
==Native implemented function (NIF)==
[[File:Potato_sprout,_January_23,_2006.jpg|right|267x267px]]
[[File:Potato_sprout,_January_23,_2006.jpg|right|267x267px]]TODO...
This is where I began to feel like I was learning stuff.
 
Elixir owes its existence to Erlang and the virtual machine that runs them both, BEAM<ref>https://blog.stenmans.org/theBeamBook/#_the_erlang_virtual_machine_beam</ref>.  Extending the machine is done by writing a NIF library<ref>https://www.erlang.org/doc/man/erl_nif.html</ref> which consists of a C/C++ to adapt the library, and a small Erlang or Elixir stub to define an interface.
 
TODO ...


== Parallel processing ==
== Parallel processing ==

Revision as of 08:26, 14 February 2025

A chronicle of my first Erlang/Elixir library binding (NIF).

Adam Wight, Sept 2022



Problem statement

Phap Nang Ngam Nai Wannakhadi (1964, p 60). [This painting is not titled, "Picking the low-hanging fruit". -AW

I wanted to process some large, compressed files containing Wikipedia content[1], which couldn't be expanded in place. The typical approach to this problem is to stream the decompressed data through the desired analysis in memory and then throw it away.

Decompression can be accomplished by piping through an external, command-line tool or by reading the file using a native Elixir codec. In my case, I chose to mix these approaches by untarring through tar using a Port, but use a native bzip2 library to perform the decompression.

In hindsight, it would have been much simpler to use command-line bunzip2. The native library should make it possible to use backpressure and concurrency. But mostly I just got excited about a small gap in the BEAM ecosystem and wanted to teach myself how to write an Erlang native implemented function, or NIF[2].

How hard could it be to write a little binding...

Mysteries of libbzip2

The first interesting obstacle was that development of official bzip2 has stopped at the last stable release, with v1.0.x in 2019.[3] A new group of people has been working towards a fork[4] that they're calling version "1.1" but hopefully will avoid breaking changes to the programming interface. This is still unreleased as of 2025.

The second point worth mentioning is that the bzip2 file format has no formal specification. This situation is pretty common and I can't complain, because there's a brilliant reverse-engineering[5] effort which included the details I needed.

High- or low-level integration?

As I mentioned, my own project was already in hard mode due to the decision to write a NIF at all. But there was a second choice, between a the libbzip2 high-level interface[6] which does everything for you: open the file and return its contents decompressed, or the low-level interface[7] which works with block or even sub-block chunks of data.

Here I learned the most important requirement of a NIF binding: it does work within the BEAM memory and process space but it must return control to the Elixir scheduler within a very short time period, less than 100ms or so. Low-level it is, then!

Native implemented function (NIF)

TODO...

Parallel processing

TODO: Stream vs block, what can write multiple streams, what are the challenges of detecting blocks...

References