Elixir/HTML dump scraper: Difference between revisions

Adamw (talk | contribs)
m What are references?: play with language
Adamw (talk | contribs)
distracting caption to help background the decorative image
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]]
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb|''Introverted Millipede'']]
{{Project|status=(beta)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}
{{Project|status=(beta)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}}


Line 41: Line 41:
The processing is broken down into small units which each write a single file.  If a file already exists, then we skip the corresponding calculation.  This general caching technique is known as [[w:Memoization|memoization]].
The processing is broken down into small units which each write a single file.  If a file already exists, then we skip the corresponding calculation.  This general caching technique is known as [[w:Memoization|memoization]].


During each step of processing, we also write to a separate checkpoint file after every 100 articles.  The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash results in inconsistency.  When resuming, if a checkpoint file is present then the job will skip articles until it catches up with the number given there.  The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is that some articles might be duplicated in the output.
During each unit file, also write to a checkpoint file at multiples of 100 articles processed.  The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash can result in inconsistency.  When resuming, if a checkpoint file is present then the job will skip articles without processing them, until the total count catches up with the count given in the checkpoint file.  The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is to duplicate work for some articles.


== Concurrency ==
== Concurrency ==