Elixir/HTML dump scraper: Difference between revisions

light copyedit; put references discussion in its own section; begin additional sections
Adamw (talk | contribs)
Add a short section on concurrency
Line 33: Line 33:


During each step of processing, we also write to a separate checkpoint file after every 100 articles.  The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash results in inconsistency.  When resuming, if a checkpoint file is present then the job will skip articles until it catches up with the number given there.  The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is that some articles might be duplicated in the output.
During each step of processing, we also write to a separate checkpoint file after every 100 articles.  The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash results in inconsistency.  When resuming, if a checkpoint file is present then the job will skip articles until it catches up with the number given there.  The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is that some articles might be duplicated in the output.
== Concurrency ==
There are hundreds of separate wikis, so splitting up the work by wiki and processing these concurrently is a natural first implementation.
When splitting by wiki, we ran into an interesting problem where the partitioning function was using <code>:erlang.phash2</code> to hash an object which contained the wiki ID so we assumed that it would give different results for each wiki, but as it turns out the <code>Flow.partition</code> function needed explicit clues to correctly split by wiki.
The next obvious fork point would be in the phase which makes external API requests, but this is trickier because we want to limit total concurrency across all wikis as well, to avoid overwhelming the service.  This should be implemented with a connection pool, ideally one which reuses a small number of connections according to HTTP/1.1 .