Elixir/HTML dump scraper: Difference between revisions
m →What are references?: play with language |
distracting caption to help background the decorative image |
||
| (One intermediate revision by the same user not shown) | |||
| Line 1: | Line 1: | ||
[[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb]] | [[File:1.Magukbaforduló ikerszelvényesek 72dpi.jpg|thumb|''Introverted Millipede'']] | ||
{{Project|status=(beta)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}} | {{Project|status=(beta)|url=https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump}} | ||
| Line 41: | Line 41: | ||
The processing is broken down into small units which each write a single file. If a file already exists, then we skip the corresponding calculation. This general caching technique is known as [[w:Memoization|memoization]]. | The processing is broken down into small units which each write a single file. If a file already exists, then we skip the corresponding calculation. This general caching technique is known as [[w:Memoization|memoization]]. | ||
During each | During each unit file, also write to a checkpoint file at multiples of 100 articles processed. The output file is written in chunks immediately after writing to the checkpoint file, which reduces the window of time in which a crash can result in inconsistency. When resuming, if a checkpoint file is present then the job will skip articles without processing them, until the total count catches up with the count given in the checkpoint file. The overall behavior of the job is therefore "at least once" processing, meaning that the only type of irregularity that can happen is to duplicate work for some articles. | ||
== Concurrency == | == Concurrency == | ||