Elixir/Ports and external process wiring: Difference between revisions

Adamw (talk | contribs)
m link
Adamw (talk | contribs)
Add some introduction
Line 1: Line 1:
==== Challenge: controlling "rsync" ====
This exploration began as I wrote a simple library to run rsync from Elixir.<ref>https://hexdocs.pm/rsync/Rsync.html</ref>  I was hoping to learn how to interface with long-lived external processes, in this case to transfer files and monitor progress.  Starting and reading from rsync went very well, thanks to the <code>--info=progress2</code> option which reports progress in a fairly machine-readable format.  I was able to start the file transfer, capture status, and report it back to the Elixir caller in various ways.
My library starts rsync using a low-level <code>Port</code> call, which maps directly to the base Erlang open_port<ref>https://www.erlang.org/doc/apps/erts/erlang.html#open_port/2</ref> implementation:<syntaxhighlight lang="elixir">
Port.open(
  {:spawn_executable, rsync_path},
  [
    :binary,
    :exit_status,
    :hide,
    :use_stdio,
    :stderr_to_stdout,
    args:
      ~w(-a --info=progress2) ++
        rsync_args ++
        sources ++
        [args[:target]],
    env: env
  ]
)
</syntaxhighlight>
==== Problem: runaway processes ====
Since I was calling my rsync library from an application under development, I would often kill the program abruptly by crashing or by typing <control>-C in the terminal.  What I found is that the rsync transfer would continue to run in the background even after Elixir had completely shut down.
That would have to change—leaving overlapping file transfers running unmonitored is exactly what I wanted to avoid by having Elixir control the process in the first place.
==== Bad assumption: pipe-like processes ====
A common use case is to use external processes for something like compression and decompression.  A program like <code>gzip</code> or <code>cat</code> will stop once it detects that its input has ended, using a C system call like this:<syntaxhighlight lang="c">
ssize_t n_read = read (input_desc, buf, bufsize);
if (n_read < 0) { error... }
if (n_read == 0) { end of file... }
</syntaxhighlight>The manual for read<ref>https://man.archlinux.org/man/read.2</ref> explains that reading 0 bytes indicates the end of file, and a negative number indicates an error such as the input file descriptor already being closed.
BEAM assumes the connected process behaves like this, so nothing needs to be done to clean up a dangling external process because it will end itself as soon as the Port is closed or the BEAM exits.  If the external process is known to not behave this way, the recommendation is to wrap it in a shell script which converts a closed stdin into a kill signal.<ref>https://hexdocs.pm/elixir/main/Port.html#module-orphan-operating-system-processes</ref>
==== BEAM internal and external processes ====
==== BEAM internal and external processes ====
[[W:BEAM (Erlang virtual machine)|BEAM]] applications are built out of supervision trees and excel at managing huge numbers of parallel actor processes, all scheduled internally.  Although the communities' mostly share a philosophy of running as much as possible inside of the VM because it builds on this strength, and simplifies away much interface glue and context switching, on many occasions it will still start an external OS process.  There are some straightforward ways to simply run a command line, which might be familiar to programmers coming from another language: <code>[https://www.erlang.org/doc/apps/kernel/os.html#cmd/2 os:cmd]</code> takes a string and runs the thing.  At a lower level, external programs are managed through a [https://www.erlang.org/doc/system/ports.html Port] which is a flexible abstraction allowing a backend driver to communicate data in and out, and to send some control signals such as reporting an external process's exit and exit status.
[[W:BEAM (Erlang virtual machine)|BEAM]] applications are built out of supervision trees and excel at managing huge numbers of parallel actor processes, all scheduled internally.  Although the communities' mostly share a philosophy of running as much as possible inside of the VM because it builds on this strength, and simplifies away much interface glue and context switching, on many occasions it will still start an external OS process.  There are some straightforward ways to simply run a command line, which might be familiar to programmers coming from another language: <code>[https://www.erlang.org/doc/apps/kernel/os.html#cmd/2 os:cmd]</code> takes a string and runs the thing.  At a lower level, external programs are managed through a [https://www.erlang.org/doc/system/ports.html Port] which is a flexible abstraction allowing a backend driver to communicate data in and out, and to send some control signals such as reporting an external process's exit and exit status.