Draft:Elixir/bzip2-ex: Difference between revisions

Adamw (talk | contribs)
rewrite
Adamw (talk | contribs)
Found another library, bzip2_decomp
 
Line 1: Line 1:
A chronicle of my first Erlang/Elixir library binding (NIF).
An adventure story of my first Erlang/Elixir library binding (NIF).


''Adam Wight, Sept 2022''
''Adam Wight, Sept 2022''
Line 6: Line 6:


== Problem statement ==
== Problem statement ==
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60).  [This painting is not titled, "Picking the low-hanging fruit". -AW]]I wanted to process some large, compressed files containing Wikipedia content<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which couldn't be expanded in place.  The typical approach to this problem is to stream the decompressed data through the desired analysis in memory and then throw it away.
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60).  [This painting is not titled, "Picking the low-hanging fruit". -AW]]I wanted to process some large, compressed files containing Wikipedia content<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which couldn't be expanded in-place.  The typical approach to this problem is to stream the decompressed data through the desired analysis in memory and then throw it away.


Decompression can be accomplished by piping through an external, command-line tool or by reading the file using a native Elixir codec.  In my case, I chose to mix these approaches by untarring through tar using a Port, but use a native bzip2 library to perform the decompression.
Decompression can be accomplished by piping through an external, command-line tool or by reading the file using a native Elixir codec.  In my case, I chose to mix these approaches by untarring through tar using a Port, but writing a native bzip2 library to perform the decompression, since none existed at the time.


In hindsight, it would have been much simpler to use command-line bunzip2.  The native library should make it possible to use backpressure and concurrency.  But mostly I just got excited about a small gap in the BEAM ecosystem and wanted to teach myself how to write an Erlang native implemented function, or NIF<ref>https://www.erlang.org/doc/apps/erts/erl_nif</ref>.
In hindsight, it would have been much simpler to use command-line bunzip2.  The native library should make it possible to use backpressure and concurrency.  But mostly I just got excited about a small gap in the BEAM ecosystem and wanted to teach myself how to write an Erlang native implemented function, or NIF<ref>https://www.erlang.org/doc/apps/erts/erl_nif</ref>.
Line 24: Line 24:


Here I learned the most important requirement of a NIF binding: it does work within the BEAM memory and process space but it must return control to the Elixir scheduler within a very short time period, less than 100ms or so.  Low-level it is, then!
Here I learned the most important requirement of a NIF binding: it does work within the BEAM memory and process space but it must return control to the Elixir scheduler within a very short time period, less than 100ms or so.  Low-level it is, then!
If you want to look into yet another approach, Moosieus<ref>https://github.com/Moosieus/bzip2_decomp</ref> has written an Elixir binding for pure Rust bzip2-rs<ref>https://github.com/paolobarbolini/bzip2-rs</ref>.  This looks good for decompression, but executes in a single run rather than streaming.


==Native implemented function (NIF)==
==Native implemented function (NIF)==