Jump to content

Elixir/Ports and external process wiring: Difference between revisions

From ludd
Adamw (talk | contribs)
more aside
Adamw (talk | contribs)
No edit summary
Line 46: Line 46:
   ]
   ]
)
)
</syntaxhighlight>
Progress lines come in with a fairly self-explanatory format:
<syntaxhighlight lang="text">
      3,342,336  33%    3.14MB/s    0:00:02
</syntaxhighlight>
</syntaxhighlight>


Line 67: Line 62:
}}
}}


Each rsync output line is sent to the library's <code>handle_info</code> callback as <code>{:data, line}</code> and after the transfer is finished we receive a conclusive <code>{:exit_status, status_code}</code>.
Rsync outputs progress lines in a fairly self-explanatory format:<syntaxhighlight lang="text">
      3,342,336  33%    3.14MB/s    0:00:02
</syntaxhighlight>


We extract the percent_done column and strictly reject any other output:
Our Port captures output and each line is sent to the library's <code>handle_info</code> callback as <code>{:data, line}</code>.  After the transfer is finished we receive a conclusive <code>{:exit_status, status_code}</code> message.
 
As a first step, we extract the percent_done column and log any unrecognized output:
<syntaxhighlight lang="elixir">
<syntaxhighlight lang="elixir">
with terms when terms != [] <- String.split(line, ~r"\s", trim: true),
with terms when terms != [] <- String.split(line, ~r"\s", trim: true),
Line 79: Line 78:
         {:unknown, line}
         {:unknown, line}
     end
     end
</syntaxhighlight>The <code>trim</code> lets us ignore spacing and newline trickery—or even a leading carriage return as you can see in the rsync source code,
</syntaxhighlight>The <code>trim</code> is lifting more than its weight here: it lets us completely ignore spacing and newline trickery—and even a leading carriage return that we can see in the rsync source code,
<syntaxhighlight lang="c">
<syntaxhighlight lang="c">
rprintf(FCLIENT, "\r%15s %3d%% %7.2f%s %s%s", ...);
rprintf(FCLIENT, "\r%15s %3d%% %7.2f%s %s%s", ...);
</syntaxhighlight>The carriage return <code>\r</code> deserves a special mention: this "control" character is just a byte in the binary data coming over the pipe from rsync, but it plays a control function because of how the tty interprets it.  On the terminal the effect is to overwrite the current line!
</syntaxhighlight>Carriage return <code>\r</code> deserves a special mention: this "control" character is just a byte in the binary data coming over the pipe from rsync, but its normal role is playing a control function because of how the terminal emulator responds to it.  On a terminal the effect is to rewind the cursor and overwrite the current line!
 
A repeated theme in inter-process communication is that data and control are leaky categories.  We come to the more formal control side channels later.


A repeated theme is that data and control are leaky categories.  We come to the more formal control side channels later.
{{Aside|text=
{{Aside|text=
[[File:Chinese typewriter 03.jpg|right|200x200px]]
[[File:Chinese typewriter 03.jpg|right|200x200px]]
Line 90: Line 90:
On the terminal, rsync progress lines are updated in place by emitting a [[w:Carriage return|carriage return]] control character, <code>\r</code>, <code>0x0d</code> sometimes rendered as <code>^M</code>.  The character seems to be named after pushing the physical paper carriage of a typewriter back to the beginning of the line without feeding the roller.
On the terminal, rsync progress lines are updated in place by emitting a [[w:Carriage return|carriage return]] control character, <code>\r</code>, <code>0x0d</code> sometimes rendered as <code>^M</code>.  The character seems to be named after pushing the physical paper carriage of a typewriter back to the beginning of the line without feeding the roller.


[[w:https://en.wikipedia.org/wiki/Newline#Issues_with_different_newline_formats|Disagreement about carriage return]] vs. newline has caused eye-rolling since the dawn of personal computing.
[[w:https://en.wikipedia.org/wiki/Newline#Issues_with_different_newline_formats|Disagreement about carriage return]] vs. line feed has caused eye-rolling since the dawn of personal computing.


[[File:Nilgais fighting, Lakeshwari, Gwalior district, India.jpg|left|200x200px]]
[[File:Nilgais fighting, Lakeshwari, Gwalior district, India.jpg|left|200x200px]]

Revision as of 10:52, 19 October 2025

A deceivingly simple programming adventure veers unexpectedly into piping and signaling between unix processes.

Context: controlling "rsync"


My exploration begins while writing a beta-quality library for Elixir to transfer files in the background and monitor progress, using rsync.



I was excited to learn how to interface with long-lived external processes—and this project offered more than I hoped for.

A Toque macaque (Macaca radiata) Monkey eating peanuts. Pictured in Bangalore, India

Naive shelling

Starting rsync should be as easy as calling out to a shell:

System.shell("rsync -a source target")

This has a few shortcomings, starting with how we pass the filenames. It's possible to have a dynamic path coming from string interpolation like #{source} but this gets risky: consider what happens if the filenames include whitespace or even special shell characters such as ";".

Safe path handling

Skipping ahead to System.cmd, which takes a raw argv and can't be fooled special characters in the path arguments:

System.find_executable(rsync_path)
|> System.cmd([~w(-a), source, target])

For a short job this would be fine, but during longer transfers our program loses control and we have to wait indefinitely for the monolithic command to finish.

Asynchronous call and communication

To run a external process asynchronously we will reach for Elixir's low-level Port.open which passes all of its parameters directly[1] to ERTS open_port[2]. These functions are tremendously flexible, here we turn a few knobs:

Port.open(
  {:spawn_executable, rsync_path},
  [
    :binary,
    :exit_status,
    :hide,
    :use_stdio,
    :stderr_to_stdout,
    args:
      ~w(-a --info=progress2) ++
        rsync_args ++
        sources ++
        [args[:target]],
    env: env
  ]
)



Rsync outputs progress lines in a fairly self-explanatory format:

      3,342,336  33%    3.14MB/s    0:00:02

Our Port captures output and each line is sent to the library's handle_info callback as {:data, line}. After the transfer is finished we receive a conclusive {:exit_status, status_code} message.

As a first step, we extract the percent_done column and log any unrecognized output:

with terms when terms != [] <- String.split(line, ~r"\s", trim: true),
         percent_done_text when is_binary(percent_done_text) <- Enum.at(terms, 1),
         {percent_done, "%"} <- Float.parse(percent_done_text) do
      percent_done
    else
      _ ->
        {:unknown, line}
    end

The trim is lifting more than its weight here: it lets us completely ignore spacing and newline trickery—and even a leading carriage return that we can see in the rsync source code,

rprintf(FCLIENT, "\r%15s %3d%% %7.2f%s %s%s", ...);

Carriage return \r deserves a special mention: this "control" character is just a byte in the binary data coming over the pipe from rsync, but its normal role is playing a control function because of how the terminal emulator responds to it. On a terminal the effect is to rewind the cursor and overwrite the current line!

A repeated theme in inter-process communication is that data and control are leaky categories. We come to the more formal control side channels later.



OTP generic server

This is where Erlang/OTP really starts to shine: our rsync library wraps the Port calls under a gen_server[4] module and this gives us some special properties for free: a dedicated thread which coordinates with rsync independently from anything else, receiving and sending asynchronous messages. It has an internal state including the latest percent done and this can be probed by calling code, or it can be set up to push updates to a listener.

A gen_server should be able to run under a OTP supervision tree as well but our module has a major flaw: although it can correctly detect and report when rsync crashes or completes, when our gen_server is stopped by its supervisor it cannot stop its external child process in turn.

Problem: runaway processes

What this means is that rsync transfers would continue to run in the background even after Elixir had completely shut down, because the BEAM has no way of stopping the process.

To check whether this was something specific to rsync, I tried to open a Port spawning the command sleep 60 and I found that it behaves exactly the same way, hanging until the sleep ends naturally regardless of what happened in Elixir or whether its pipes are still open.

Bad assumption: pipe-like processes

A program like gzip or cat will stop once it detects that its input has ended because the main loop usually makes a C system call to read like this:

ssize_t n_read = read (input_desc, buf, bufsize);
if (n_read < 0) { error... }
if (n_read == 0) { end of file... }

The manual for read[5] explains that reading 0 bytes indicates the end of file, and a negative number indicates an error such as the input file descriptor already being closed. If you think this sounds weird, I would agree: how do we tell the difference between a stream which is stalled and one which has ended? Does the calling process yield control until input arrives? How do we know if more than bufsize bytes are available? If that word salad excites you, read more about O_NONBLOCK[6] and unix pipes[7].

But here we'll focus on how processes affect each other through pipes. Surprising answer: it doesn't affect very much! Try opening a "cat" in the terminal and then type <control>-d to "send" an end-of-file. Oh no, you killed it! You didn't actually send anything, though—the <control>-d is interpreted by bash and it responds by closing its pipe connected to "standard input" of the child process. This is similar to how <control>-c is not sending a character but is interpreted by the terminal, trapped by the shell and forwarded as an interrupt signal to the child process, completely independently of the data pipe. My entry point to learning more is this stty webzine[8] by Julia Evans. Go ahead and try this command, what could go wrong: stty -a

Any special behavior at the other end of a pipe is the result of intentional programming decisions and "end of file" (EOF) is more a convention than a hard reality. You could even reopen stdin from the application, to the great surprise of your friends and neighbors. For example, try opening "watch ls" or "sleep 60" and try <control>-d all you want—no effect. You did close its stdin but nobody cared, it wasn't listening to you anyway.

Back to the problem at hand, "rsync" is in this latter category of "daemon-like" programs which will carry on even after standard input is closed. This makes sense enough, since rsync isn't interactive and any output is just a side effect of its main purpose.

Shimming can kill

It's possible to write a small adapter which is sensitive to stdin closing, then converts this into a stronger signal like SIGTERM which it forwards to its own child. This is the idea behind a suggested shell script[9] for Elixir and the erlexec[10] library. The opposite adapter is also found in the nohup shell command and the grimsby[11] library: these will keep standard in and/or standard out open for the child process even after the parent exits.

I took the shim approach with my rsync library and included a small C program[12] which wraps rsync and makes it sensitive to the BEAM port_close. It's featherweight, leaving pipes unchanged as it passes control to rsync—its only real effect is to convert SIGHUP to SIGKILL (but should have been SIGTERM, see the sidebar discussion of different signals below).

Reliable clean up

It's always a pleasure to ask questions in the BEAM communities, they have earned their reputation for being friendly and open. The first big tip was to look at the third-party library erlexec, which demonstrates emerging best practices which could be backported into the language itself. Everyone speaking on the problem has generally agreed that the fragile clean up of external processes is a bug, and supported the idea that some flavor of "terminate" signal should be sent to spawned programs.

I would be lying to hide my disappointment that the required core changes are mostly in a C program and not actually in Erlang, but it was still fascinating to open such an elegant black box and find the technological equivalent of a steam engine inside. All of the futuristic, high-level features we've come to know actually map closely to a few scraps of wizardry with ordinary pipes, using stdlib read, write, and select[13].

Port drivers[14] are fundamental to ERTS and external processes are launched through several levels of wiring: the spawn driver starts a forker driver which sends a control message to erl_child_setup to execute your external command. Each BEAM has a single erl_child_setup process to watch over all children.

Letting a child process outlive the one that spawned leaves it in a state called an "orphaned process" in POSIX, and the standard recommends that when this happens the process should be adopted by the top-level system process "init" if it exists. This can be seen as undesirable because unix itself has a paradigm similar to OTP's Supervisors, in which each parent is responsible for its children. Without supervision, a process could potentially run forever or do naughty things. The system init process starts and tracks its own children, and can restart them in response to service commands. But init will know nothing about adopted, orphan processes or how to monitor and restart them.

The patch PR#9453 adapting port_close to SIGTERM is waiting for review and responses look generally positive so far.



Future directions

Discussion threads also included some notable grumbling about the Port API in general, it seems this part of ERTS is overdue for a larger redesign.

There's a good opportunity to unify the different platform implementations: Windows lacks the erl_child_setup layer entirely, for example.

Another idea to borrow from the erlexec library is to have an option to kill the entire process group of a child, which is shared by any descendants that haven't explicitly broken out of its original group. This would be useful for managing deep trees of external processes launched by a forked command.

References