I kept asking myself, what would happen if I were to extend on the feature currently presented in the chapter I was reading, only to find out my answers in the next chapter.
Brilliant book
* The Morning Paper blog's distributed systems tag [2] has a lot of good summaries of research on distributed systems, both from academia and industry.
* I maintain a list of assorted resources on distributed system design and operations on GitHub. [3]
* Also, as mentioned, Designing Data-Intensive Applications is a good starting place.
[1] https://github.com/donnemartin/system-design-primer
[2] https://blog.acolyer.org/tag/distributed-systems/
[3] https://github.com/DylanSp/distributed-systems-resources
Would also recommend reading VLDB and DB it shows how distri algorithms are applied - http://www.vldb.org/pvldb/vol9.html - http://www.redbook.io/
Disclaimer: I used to work at Couchbase(distributed NoSQL database) as a PM and launched Eventing.
https://aws.amazon.com/builders-library/?cards-body.sort-by=...
> Bulk Synchronous Parallel: https://en.wikipedia.org/wiki/Bulk_synchronous_parallel .
Many/most (?) distributed systems can be described in terms of BSP primitives.
> Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) .
> Raft: https://en.wikipedia.org/wiki/Raft_(computer_science) #Safety
> CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem .
Papers-we-love > Distributed Systems: https://github.com/papers-we-love/papers-we-love/tree/master...
awesome-distributed-systems also has many links to theory: https://github.com/theanalyst/awesome-distributed-systems
- Byzantine fault: https://en.wikipedia.org/wiki/Byzantine_fault :
> A [Byzantine fault] is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine Generals Problem",[2] developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.
awesome-bigdata lists a number of tools: https://github.com/onurakpolat/awesome-bigdata
Practically, dask.distributed (joblib -> SLURM,), dask ML, dask-labextension (a JupyterLab extension for dask), and the Rapids.ai tools (e.g. cuDF) scale from one to many nodes.