Draft:Elixir/bzip2-ex: Difference between revisions
No edit summary Tags: Mobile edit Mobile web edit Visual edit |
Link to project source |
||
(2 intermediate revisions by the same user not shown) | |||
Line 2: | Line 2: | ||
''Adam Wight, Sept 2022'' | ''Adam Wight, Sept 2022'' | ||
{{Project|url=https://gitlab.com/adamwight/bzip2-ex}} | |||
== Background == | == Background == | ||
[[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60)]] | [[File:Phap Nang Ngam Nai Wannakhadi (1964, p 60).jpg|thumb|Phap Nang Ngam Nai Wannakhadi (1964, p 60). [This painting is not titled, "Picking the low-hanging fruit". -AW]]One common way to analyze Wikipedia content is to mung its database backup dumps<ref>https://dumps.wikimedia.org/backup-index.html</ref>, which are provided as [[W:bzip2|bzip2]]-compressed XML. These files are too large to use unpacked and unwieldy even when compressed, so are best served streaming. | ||
One common way to analyze Wikipedia content is to | |||
== Problem statement== | == Problem statement== | ||
Unfortunately, there was no BEAM (Elixir- and Erlang-compatible) library for reading bzip2 files, so we could either fetch and decompress the data externally through an separate process, or write a | Unfortunately, there was no BEAM (Elixir- and Erlang-compatible) library for reading bzip2 files, so we could either fetch and decompress the data externally through an separate process, or write a glue library exposing libbzip2 functions in Elixir. | ||
How hard could it be to write a binding... | How hard could it be to write a little binding... | ||
==Is libbzip2 maintained?== | ==Is libbzip2 maintained?== | ||
Development of bzip2 has stopped at the last stable release, with v1.0.x in 2019.<ref>The project page for bzip2 v1.0 is https://sourceware.org/bzip2/.</ref> A new group has been working towards an initial fork<ref>https://gitlab.com/bzip2/bzip2/</ref> that they're calling version "1.1" but hopefully will avoid breaking changes to the programming interface. | |||
== The bzip2 file format has no specification == | |||
I can't complain, there's a brilliant reverse-engineering effort<ref>https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf</ref>. | |||
<br clear="all"> | |||
==High- or low-level integration?== | |||
== High- or low-level integration? == | |||
We already chose the "hard landing" by writing a new library in the first place, but there's still an opportunity for making life sort of easy again, by choosing one of two APIs provided by libbzip2, either the high-level<ref>https://sourceware.org/bzip2/manual/manual.html#hl-interface</ref> interface which opens a file and returns the decompressed contents, or the low-level<ref>https://sourceware.org/bzip2/manual/manual.html#low-level</ref> interface for maximum interactivity, which is called with tiny bites of data, shares its memory with the caller, exposes an internal state machine, and offers your humble developer many forms of additional stimulation. | We already chose the "hard landing" by writing a new library in the first place, but there's still an opportunity for making life sort of easy again, by choosing one of two APIs provided by libbzip2, either the high-level<ref>https://sourceware.org/bzip2/manual/manual.html#hl-interface</ref> interface which opens a file and returns the decompressed contents, or the low-level<ref>https://sourceware.org/bzip2/manual/manual.html#low-level</ref> interface for maximum interactivity, which is called with tiny bites of data, shares its memory with the caller, exposes an internal state machine, and offers your humble developer many forms of additional stimulation. | ||
There were a few considerations here, and in theory the high-level interface would have made a perfectly acceptable binding. However, this would be a single call to decompress the world, and control flow wouldn't return to Elixir for several hours. This pretty much defeats the purpose of writing a binding library for a concurrent virtual machine. | There were a few considerations here, and in theory the high-level interface would have made a perfectly acceptable binding. However, this would be a single call to decompress the world, and control flow wouldn't return to Elixir for several hours. This pretty much defeats the purpose of writing a binding library for a concurrent virtual machine. | ||
== Native implemented function (NIF) == | ==Native implemented function (NIF)== | ||
[[File:Potato_sprout,_January_23,_2006.jpg|right|267x267px]] | |||
This is where I began to feel like I was learning stuff. | This is where I began to feel like I was learning stuff. | ||
Elixir owes its existence to Erlang and the virtual machine that runs them both, BEAM<ref>https://blog.stenmans.org/theBeamBook/#_the_erlang_virtual_machine_beam</ref>. Extending the machine is done by writing a NIF library<ref>https://www.erlang.org/doc/man/erl_nif.html</ref> which consists of | Elixir owes its existence to Erlang and the virtual machine that runs them both, BEAM<ref>https://blog.stenmans.org/theBeamBook/#_the_erlang_virtual_machine_beam</ref>. Extending the machine is done by writing a NIF library<ref>https://www.erlang.org/doc/man/erl_nif.html</ref> which consists of a C/C++ to adapt the library, and a small Erlang or Elixir stub to define an interface. | ||
TODO ... | |||
== Parallel processing == | |||
TODO: Stream vs block, what can write multiple streams, what are the challenges of detecting blocks... | |||
== References == | ==References== | ||
<references /> | <references /> |
Latest revision as of 19:41, 25 March 2023
A chronicle of my first Erlang/Elixir library binding (NIF).
Adam Wight, Sept 2022
Project link:
https://gitlab.com/adamwight/bzip2-ex
Background[edit | edit source]
One common way to analyze Wikipedia content is to mung its database backup dumps[1], which are provided as bzip2-compressed XML. These files are too large to use unpacked and unwieldy even when compressed, so are best served streaming.
Problem statement[edit | edit source]
Unfortunately, there was no BEAM (Elixir- and Erlang-compatible) library for reading bzip2 files, so we could either fetch and decompress the data externally through an separate process, or write a glue library exposing libbzip2 functions in Elixir.
How hard could it be to write a little binding...
Is libbzip2 maintained?[edit | edit source]
Development of bzip2 has stopped at the last stable release, with v1.0.x in 2019.[2] A new group has been working towards an initial fork[3] that they're calling version "1.1" but hopefully will avoid breaking changes to the programming interface.
The bzip2 file format has no specification[edit | edit source]
I can't complain, there's a brilliant reverse-engineering effort[4].
High- or low-level integration?[edit | edit source]
We already chose the "hard landing" by writing a new library in the first place, but there's still an opportunity for making life sort of easy again, by choosing one of two APIs provided by libbzip2, either the high-level[5] interface which opens a file and returns the decompressed contents, or the low-level[6] interface for maximum interactivity, which is called with tiny bites of data, shares its memory with the caller, exposes an internal state machine, and offers your humble developer many forms of additional stimulation.
There were a few considerations here, and in theory the high-level interface would have made a perfectly acceptable binding. However, this would be a single call to decompress the world, and control flow wouldn't return to Elixir for several hours. This pretty much defeats the purpose of writing a binding library for a concurrent virtual machine.
Native implemented function (NIF)[edit | edit source]
This is where I began to feel like I was learning stuff.
Elixir owes its existence to Erlang and the virtual machine that runs them both, BEAM[7]. Extending the machine is done by writing a NIF library[8] which consists of a C/C++ to adapt the library, and a small Erlang or Elixir stub to define an interface.
TODO ...
Parallel processing[edit | edit source]
TODO: Stream vs block, what can write multiple streams, what are the challenges of detecting blocks...
References[edit | edit source]
- ↑ https://dumps.wikimedia.org/backup-index.html
- ↑ The project page for bzip2 v1.0 is https://sourceware.org/bzip2/.
- ↑ https://gitlab.com/bzip2/bzip2/
- ↑ https://github.com/dsnet/compress/blob/master/doc/bzip2-format.pdf
- ↑ https://sourceware.org/bzip2/manual/manual.html#hl-interface
- ↑ https://sourceware.org/bzip2/manual/manual.html#low-level
- ↑ https://blog.stenmans.org/theBeamBook/#_the_erlang_virtual_machine_beam
- ↑ https://www.erlang.org/doc/man/erl_nif.html