Draft:Elixir/bzip2-ex: Difference between revisions
provide background |
Found another library, bzip2_decomp |
||
| (12 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
An adventure story of my first Erlang/Elixir library binding (NIF). | |||
''Adam Wight, Sept 2022'' | |||
{{Project|url=https://gitlab.com/adamwight/bzip2-ex}} | |||
== Problem statement == | == Problem statement == | ||
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60). [This painting is not titled, "Picking the low-hanging fruit". -AW]]I wanted to process some large, compressed files containing Wikipedia content<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which couldn't be expanded in-place. The typical approach to this problem is to stream the decompressed data through the desired analysis in memory and then throw it away. | |||
Decompression can be accomplished by piping through an external, command-line tool or by reading the file using a native Elixir codec. In my case, I chose to mix these approaches by untarring through tar using a Port, but writing a native bzip2 library to perform the decompression, since none existed at the time. | |||
== The bzip2 file format has no specification == | In hindsight, it would have been much simpler to use command-line bunzip2. The native library should make it possible to use backpressure and concurrency. But mostly I just got excited about a small gap in the BEAM ecosystem and wanted to teach myself how to write an Erlang native implemented function, or NIF<ref>https://www.erlang.org/doc/apps/erts/erl_nif</ref>. | ||
How hard could it be to write a little binding... | |||
==Mysteries of libbzip2== | |||
The first interesting obstacle was that development of official bzip2 has stopped at the last stable release, with v1.0.x in 2019.<ref>The project page for bzip2 v1.0 is https://sourceware.org/bzip2/.</ref> A new group of people has been working towards a fork<ref>https://gitlab.com/bzip2/bzip2/</ref> that they're calling version "1.1" but hopefully will avoid breaking changes to the programming interface. This is still unreleased as of 2025. | |||
The second point worth mentioning is that the bzip2 file format has no formal specification. This situation is pretty common and I can't complain, because there's a brilliant reverse-engineering<ref>https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf</ref> effort which included the details I needed. | |||
==High- or low-level integration?== | |||
As I mentioned, my own project was already in hard mode due to the decision to write a NIF at all. But there was a second choice, between a the libbzip2 high-level interface<ref>https://sourceware.org/bzip2/manual/manual.html#hl-interface</ref> which does everything for you: open the file and return its contents decompressed, or the low-level interface<ref>https://sourceware.org/bzip2/manual/manual.html#low-level</ref> which works with block or even sub-block chunks of data. | |||
Here I learned the most important requirement of a NIF binding: it does work within the BEAM memory and process space but it must return control to the Elixir scheduler within a very short time period, less than 100ms or so. Low-level it is, then! | |||
If you want to look into yet another approach, Moosieus<ref>https://github.com/Moosieus/bzip2_decomp</ref> has written an Elixir binding for pure Rust bzip2-rs<ref>https://github.com/paolobarbolini/bzip2-rs</ref>. This looks good for decompression, but executes in a single run rather than streaming. | |||
==Native implemented function (NIF)== | |||
[[File:Potato_sprout,_January_23,_2006.jpg|right|267x267px]]TODO... | |||
== Parallel processing == | |||
TODO: Stream vs block, what can write multiple streams, what are the challenges of detecting blocks... | |||
==References== | |||
<references /> | |||