Elixir/HTML dump scraper: Difference between revisions
Explain references more |
Talk about modular design. |
||
| Line 49: | Line 49: | ||
The next obvious fork point would be in the phase which makes external API requests, but this is trickier because we want to limit total concurrency across all wikis as well, to avoid overwhelming the service. This should be implemented with a connection pool, ideally one which reuses a small number of connections according to HTTP/1.1 . | The next obvious fork point would be in the phase which makes external API requests, but this is trickier because we want to limit total concurrency across all wikis as well, to avoid overwhelming the service. This should be implemented with a connection pool, ideally one which reuses a small number of connections according to HTTP/1.1 . | ||
== Modularity == | |||
It will be no surprise that the analyses are run as separate units under a pluggable architecture, so that the tool can be reused for various tasks. The callbacks are crude for now and the abstraction is leaky, but it at least accomplishes code encapsulation and easily [https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/config/config.exs configurable] composition. | |||
Modules must be written in Elixir, but we're also considering a language-agnostic callback if the need arises. | |||
Some aspects of modularity were no fun so we ignored them. For example, [https://gitlab.com/wmde/technical-wishes/scrape-wiki-html-dump/-/blob/main/metrics.md each metric is documented] in one big flat file. You'll also encounter some mild hardcoding such as an entire extra processing phase to make external API requests for parsed map data. | |||