Failure and Consensus Introduction

Consensus is among the most fundamental problems in distributed systems.

As an example, consider Dropbox and other file sharing services. Their goal is for the files on my laptop, their cloud server, and any other machines I own to be the same. That is, they want consensus on what files are there and what contents they have.

There are various ways to define consensus and lots of consensus algorithms, some tuned to specific problems. Consensus is easy to achieve when everything works perfectly. Consequently, good algorithms focus on consensus in the presence of failure.

In this lesson, we’ll first consider transactions which allow closely-coordinated processes to achieve consensus on values about failure itself and maintain a consistent state.

Transactions are too expensive for most distributed systems where recognizing failure and handling it correctly requires different techniques.

Beyond outright failure, synchronizing values is an interesting problem. We’ll look at Paxos as an example of a consensus algorithm.

Synchronization when some actors are malicious is an important problem known as the Byzantine Generals Problem. Bitcoin is an example of how Byzantine faults can be overcome.

Lesson Objectives

After completing this lesson, you should be able to

  1. Describe the relationship between failure and consistency.
  2. Understand ACID properties and the levels of isolation in transactions.
  3. Explain the limitations of classic transactions in distributed systems and the options for handling failure.
  4. Describe the Paxos algorithm, understand how it achieves consensus, and how it might fail.
  5. Describe Byzantine failure. Understand how the blockchain is designed to overcome byzantine failure.

Required Reading/Viewing

Additional Resources