HACKER Q&A
📣 shahrk

Learning about distributed systems?


I used to love Operating Systems during my undergrads, Modern Operating Systems by Tanenbaum is till date the only academic book I've read entirely. I recently read an article about how Amazon built Aurora by Werner Vogels and I was captivated by it. I want to start reading about Distributed Systems. What would be a good start/Road Map?


  👤 otras Accepted Answer ✓
I posted this recently, but MIT's 6.824: Distributed Systems (taught by Robert Morris, of both Morris worm and Viaweb/Y Combinator fame) is completely open and available online, and it includes video lectures, notes, readings, and programming assignments from as recent as Spring 2020 (including half of the lectures recorded from home as the pandemic strikes). The assignments even include auto-graded testing scripts, so you can verify your solution to the assignments.

https://pdos.csail.mit.edu/6.824/


👤 weitzj
The book “Designing Data-Intensive Applications” by Martin Kleppman is a fantastic read with such a concise train of thought. It builds up from basics, adds another thing, and another thing.

I kept asking myself, what would happen if I were to extend on the feature currently presented in the chapter I was reading, only to find out my answers in the next chapter.

Brilliant book


👤 DylanSp
* This System Design Primer [1] on GitHub is a decent overview of how large-scale apps are designed, with jumping-off points into many different subjects.

* The Morning Paper blog's distributed systems tag [2] has a lot of good summaries of research on distributed systems, both from academia and industry.

* I maintain a list of assorted resources on distributed system design and operations on GitHub. [3]

* Also, as mentioned, Designing Data-Intensive Applications is a good starting place.

[1] https://github.com/donnemartin/system-design-primer

[2] https://blog.acolyer.org/tag/distributed-systems/

[3] https://github.com/DylanSp/distributed-systems-resources


👤 venkasub
Stumbled on this which I felt was a very good compendium - https://dancres.github.io/Pages/

Would also recommend reading VLDB and DB it shows how distri algorithms are applied - http://www.vldb.org/pvldb/vol9.html - http://www.redbook.io/

Disclaimer: I used to work at Couchbase(distributed NoSQL database) as a PM and launched Eventing.


👤 k00b
Speaking of Werner Vogels, have you seen this blog post on DistSys reading? https://www.allthingsdistributed.com/2012/12/paper-readings-...

👤 dastbe
While I don't see it as a starting point (I think the topics require more context), I'm a big fan of the articles Amazon has published recently as the "Builder's Library"

https://aws.amazon.com/builders-library/?cards-body.sort-by=...


👤 melvinroest
Maarten van Steen has got you covered. He worked with Andrew Tanenbaum on all kinds of things back in the day :)

https://www.distributed-systems.net/index.php/books/ds3/


👤 AdamM12
I had found this to be a really good resource. High level patterns to use. https://www.oreilly.com/library/view/designing-distributed-s...

👤 westurner
From a previous question re: "Ask HN: CS papers for software architecture and design?" (https://news.ycombinator.com/item?id=15778396 and distributed systems we eventually realize were needed in the first place:

> Bulk Synchronous Parallel: https://en.wikipedia.org/wiki/Bulk_synchronous_parallel .

Many/most (?) distributed systems can be described in terms of BSP primitives.

> Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science) .

> Raft: https://en.wikipedia.org/wiki/Raft_(computer_science) #Safety

> CAP theorem: https://en.wikipedia.org/wiki/CAP_theorem .

Papers-we-love > Distributed Systems: https://github.com/papers-we-love/papers-we-love/tree/master...

awesome-distributed-systems also has many links to theory: https://github.com/theanalyst/awesome-distributed-systems

- Byzantine fault: https://en.wikipedia.org/wiki/Byzantine_fault :

> A [Byzantine fault] is a condition of a computer system, particularly distributed computing systems, where components may fail and there is imperfect information on whether a component has failed. The term takes its name from an allegory, the "Byzantine Generals Problem",[2] developed to describe a situation in which, in order to avoid catastrophic failure of the system, the system's actors must agree on a concerted strategy, but some of these actors are unreliable.

awesome-bigdata lists a number of tools: https://github.com/onurakpolat/awesome-bigdata

Practically, dask.distributed (joblib -> SLURM,), dask ML, dask-labextension (a JupyterLab extension for dask), and the Rapids.ai tools (e.g. cuDF) scale from one to many nodes.