Wendy White

Jul 119 min read

A brief history of erasure coding

Erasure coding is a key concept in data storage, but its origins stretch way back, sharing a common ancestry with cryptography. Buckle in to find out more, and yes, there’s a test at the end…

Anyone concerned with maintaining digital backups of their work will be familiar with the basic concept of using replication to increase data durability, even if they wouldn’t describe it in those words. It maps easily to maintaining copies of physical artifacts in the real world.

When the Library of Alexandria was destroyed, a large number of unique scientific texts were lost forever. This was changed by the printing press, which allowed complete, identical (or near-enough identical) copies of books to be made and geographically distributed, making it much less likely for a work to be lost to time, even before the age of the internet.

One of history’s earliest replication technologies: the printing press

Of course, no one library can act as a full back-up of every other physical library – there just isn’t the space. The same is true for data in any kind of storage system – and it only gets worse if you also want to retain historical copies of how data changes over time. To keep disk drives from overflowing, organisations chose tape backup systems for maintaining large volumes of ‘inactive’ or ‘cold’ data, which may not need to be accessed again for months or years. Tape backups have their own issues however, both logistical and practical.

Erasure coding takes a completely different approach

Rather than maintaining full duplicates of a unit of data, data is instead broken down into a set of fragments, which can be used to re-construct the original in the event of data loss. This may sound similar to RAID striping at first, but erasure coding side-steps many of the problems RAID implementations face.

To fully understand erasure coding, you first need to understand its legacy: error detection.

Error detection

Long before the invention of modern computers, mathematicians wrangled with techniques to ensure transmitted data could be verified on receipt. That is, ‘can we tell if this message reached us intact and unaltered?’

It’s the difference between sending a message to your travelling parents that says ‘the house is fine’…

Picture of some houses, and a tablet with a video phone call from a man and woman, context implies they are your parents.

…or ‘the house is fire’.

House is now on fire, and parents look horrified.

Why not just prevent errors in the first place?

It’s a nice idea, but creating an ideal transmission environment outside of a laboratory is unrealistic (even within a laboratory setting, it won’t be perfect). When it comes to digital systems, errors can arise from basic human mistakes, network glitches, noise on a data bus being interpreted as real data, or even something like a cosmic ray briefly interfering with a microchip.

So if we know there will always be problems passing data through a physical network, how exactly do we know when we have received a message that wasn’t quite right?

Parity check, please!

The foundation of error detection in computing is all about parity checks. You might be familiar with a ‘checksum’ – a calculation you can perform on the data within an object or file which will give you a unique ‘fingerprint’ (in this case, a string of letters and numbers). If you’ve ever downloaded an executable file to two identical machines, and it ran perfectly on one and not the other, comparing the checksum of each instance of the executable file may identify the problem – one copy is slightly off.

Parity checks are an early, basic form of checksums. And they’re so simple, you can do them by hand! (Or in your text editor of choice – I won’t judge you. Though there’s something pretty satisfying about doing these by hand.)

Consider we want to send a secret message of, conveniently, seven bits:

Pssssst… pass it on… 0010110

Scandalous! We absolutely can’t keep this message to ourselves. But what if we want to pass on this message to someone else, with a degree of certainty that they’ll know if it’s correct? We can simply add one more bit. Ready for a bit of spot-the-difference?

0010110 becomes… 10010110

Can you see where we added the extra bit?

It’s been added to the front of the string of zeros and ones. In this case, we added a ‘1’ for our parity bit. But why?

The rule is simple. If our message contains an even number of ‘1’s, your parity bit is ‘0’. If our message contains an odd number of ‘1’s (ours had three) then your parity bit is ‘1’. That means…

A parity bit will always be set so as to make the number of ‘1’s in your message even.

So if your local binary gossip buddies are expecting a 7-bit message from you, but they get one like ‘00010110’… they’ll know that something isn’t right. In this scenario, they can just request the message (our 8-bit data string) again, until the bits add up correctly.

Parity bit calculation: try it out!

Just like a system sending a message multiple times until it gets it right, if you’re wanting to solidify the concept of parity bits, why not have a go at adding a parity bit to the beginning of each of these 7-bit strings? I’ll share the answers at the end of this blog post.

1110110
1111011
0001100

Of course, you’ve probably already seen where this system can fall down.

What if there is more than one error, like two ones in our message becoming zeroes, e.g. 10010110 becoming 10000100? A parity check won’t pick this up.

Or, what happens if your parity bit flips, changing 10010110 to 00010110? The actual content of the message is fine, but it will be recognised as an error.

From this we can see that a parity check is certainly better than nothing, but it can only be relied upon to confirm that there is at least one error in your message. But it will never provide absolute certainty that a message is correct.

Still – it’s a start. Besides, sometimes a basic check like this is all you need. It’s easy to calculate, and doesn’t add a ton of bulk to your data. Parity checks are still used extensively to this day, particularly in hardware applications – lots of microprocessor caches use it, as do both the PCI and SCSI bus standards, along with the RS232 serial.

Pssst… one more message… parity bits don’t need to be binary

Binary messages make it easy to quickly calculate parity bits, but it’s not the only place you’ll find them. There’s a decimal equivalent of the parity bit, and you probably have at least one in your wallet right now.

Here’s one of my co-worker's old, expired credit cards (used with permission) from a bank account which no longer exists:

Example card number is 4659 4303 8751 4891

The last digit of the credit card number, in this case, ‘1’, is a check digit. Though don’t get hung up on the fact that it’s a ‘1’ – here, a check digit could be any number from zero to nine. Other, similar cards, like ID cards or social security numbers, operate the same way.

This is how a web form can validate a credit card number before actually charging it.

So if a binary parity check looks for the number of ‘1’s in a message, that is, checking for one of two possible states… then how do you think a decimal check digit is calculated?

If you guessed that it has something to do with checking for values of ten… congratulations! You’re already halfway there.

Here’s an example calculation, using the old credit card number:

This is a practical application of the Luhn algorithm, also known as the ‘mod 10’ algorithm, created by Hans Peter Luhn, a German-American computer scientist who developed the formula during his time working for IBM over the 1940s – 1950s. He’s also the guy who came up with the idea of using information buckets to speed up retrieval of data in storage!

Looking at the table above, you can probably already see what’s going on.

First, we double every even-positioned digit (the first digit in my credit card is in position ‘0’, the second in position ‘1’, the third in position ‘2’, etc.).

Second, if any of the doubled digits are now greater than a value of 9, we add the resulting two numbers together, e.g. ‘10’ becomes 1 + 0, so ‘1’, and ‘16’ becomes 1 + 6, so ‘7’.

Third, we add all of the unchanged numbers (those in positions 1, 3, 5 and so on) with the new numbers calculated in the first and second steps. In this example, the total is 79.

Finally, we choose our check digit. It will be the smallest number needed to make the total sum divisible by ten – in this case, ‘1’, as 79 + 1 = 80.

Credit card check digit calculation: try it out!

If you want to try one yourself, let’s use the luckiest person in the world’s credit card number… or is it? Will the credit card number 7777 7777 7777 777X have a final digit of 7, too?

Answer at the end of this post!

Speaking of luck – remember how our binary parity bits still needed a little of it? They could only ever tell us with certainty that “there is at least one error”. It’s time for something that can give us a little more certainty with our error checking.

A logical conclusion…

Have you ever seen a puzzle along the lines of this one?

Four start-ups, each with uniquely-coloured logos, are revolutionizing a form of transport:

one is focusing on cars,
one on bicycles,
one on horses, and
one is determined to revive and tame dinosaurs.

Using the clues provided, which start-up is focusing on which transport idea, and what colour is their logo?

This is a logic puzzle, solve-able using an elimination grid, like the one below:

The aim of these puzzles is to provide the reader with just enough information to fill out the grid, identifying the link between each element; in this case, the elements we must link are start-ups, transportation method, and logo colour. To do this, a reader will use the rows and columns in the elimination grid to, you guessed it, eliminate incorrect answers, until the grid is entirely filled.

For example, using the first clue below, S Power-21 refuses to work with animals on principle, and chose a green or red logo, we can:

Place a dash (-) where S Power-21’s row intersects with the dinosaur and horse columns, indicating that we know that company does not use either of these in their transportation idea.
Place a dash where this row intersects with the ‘blue’ and ‘yellow’ columns, as we know that S Power-21’s only options for a logo are red and green.

So now our grid looks like this:

By continuing on like this for each clue, we will eventually narrow down the options to one colour and one transport choice for each company. You can place an ‘X’ in any row or column where you have eliminated every other possible answer, confirming those two items are linked.

These are the only clues you need.

S Power-21 refuses to work with animals on principle, and chose a green or red logo.
Despite their name, Aaaaa! is not trying to revive the dinosaurs.
Yuno Dynamics has either a blue or yellow logo. Meanwhile, Divide by Her0 uses neither of those colours for their own logo.
Both S Power-21 and Aaaaa! considered working with bicycles, but only one decided to follow through on it.
The green logo doesn’t belong to the company working with cars.
The company that is ‘revolutionising horses’ has a name beginning with a vowel.
The company with a red logo is not trying to revive the dinosaurs, but the company with the blue logo is.

Ready to solve the puzzle? As always, the answer will appear at the end of this post!

Logic puzzles using elimination grids like these were first produced by mathematician Charles Lutwidge Dodgson, better known by his pen name, Lewis Caroll, author of Alice in Wonderland. But what connection do they have with the history of error detection? This will become clearer in our next section.

Using Hamming codes for error detection and correction

Or, the man who needed to get things right

Richard Hamming was a mathematician and professor who went on to work at Los Alamos on the Manhattan Project in 1945. There, he was part of a team responsible for programming the IBM calculating machines used by the Manhattan Project scientists to check and process their formulas. All of IBM’s machines at the time used punched cards to receive and represent data, similar to the one below:

A used punched card, photographed by Pete Birkinshaw.

Image used under the Creative Commons Attribution 2.0 Generic license.

Given the sombre significance of the project he worked on, Hamming took his role in checking the accuracy of processed punched card data very seriously, knowing that work like this needed precision at every part of the process. He was what he described at the time as a ‘’computer janitor”, but this desire to minimise errors stuck with him as he went on to join Bell Labs in 1946, where he eventually became known as one of the Young Turks.

The ‘Young Turks’ of Bell Labs

The Young Turks were a group of Bell Labs researchers who were offered much more freedom to work on their personal projects than was typical at scientific institutions of the time. This was due to the hugely valuable work and results that they generated early on, with others at Bell Labs recognising the need to facilitate their further exploration of their specialities. John Tukey, one of the group, was the first scientist to coin the term "bit", which was contracted from the term binary information digit.

As Hamming put it:

“We were first-class troublemakers. We did unconventional things in unconventional ways and still got valuable results. Thus management had to tolerate us and let us alone a lot of the time.”

With this freedom and support to make trouble and experiment, Hamming was able to utilise punched-card-reading calculating machines on a regular basis, but his frustration with the number of errors they produced thanks to an inaccurately punched hole or a minor alignment issue (frequently: a slightly bent card) continued to grow. One Friday, Hamming set his calculating machines to work on a set of long and complicated problems over the weekend, then went home. Unfortunately, when he returned on Monday, he discovered that due to a small error the entire calculation had died early on Saturday morning, meaning he had to start the whole thing from scratch.

Hamming was well and truly fed up with this. He paused everything else he was working on, so that he could discover a way to make calculating machines detect errors themselves – and then correct them, by re-transmitting any data identified as inconsistent.

The logic puzzle connection

So, we know about parity bits (and their weaknesses) and we know about elimination grids. Now it’s time to combine the two. Here is a message containing sixteen bits:

Hamming realised that we don’t need to check every bit in a message for errors, only certain sub-sets. But by combining the results of the sub-set checks, we not only have the ability to check for errors – we can also correct them. The trick is that this is really an 11-bit message, with 5 bits operating as parity checks. This might seem like a lot, but this method actually scales quite effectively. A 256-bit grid would only need 8 parity bits to use this method successfully.

So, just like our parity checks, each position on the grid begins at zero and increments by one. So the top-left grid position is at location (0,0) and the bottom-right position is (3,3).

Let’s check this out…

First, let’s parity check the odd columns, 1 and 3, using the first number in column 1 as the parity bit (see underlined). Remember, our column numbering starts at ‘0’, so this is the second column.

There are an even number of ‘1’s, so right now, this grid seems fine. Time for our next check.

This time, we look at the count of ‘1’s in the right half of the grid, that is, columns 2 and 3. This time, our parity bit is the first digit of column 2, underlined.

Aha! An error – there are five ‘1’s, an odd number. From this, we can conclude from this that there is definitely an error in column 2.

Now we repeat this same process, but with the rows, using rows 1 and 3, then rows 2 and 3, with our parity bits being in column 0 of the first row for each.

Okay, no error detected in row 1 or 3. Time to check rows 2 and 3!

There it is. An error. Logically. there must be an error in row 2.

This means we know that the bit in position (2,2) should be flipped, in this case, from a ‘1’ to a ‘0’.

But is there only one error? The trick is in our final bit at (0,0). Use it as a final parity check for the entire grid:

Eight ‘1’s! An even number. We can now be reasonably satisfied that we now have the correct data.

Okay, so we can now detect and correct single-bit errors – and using the parity bit at (0,0) we can detect errors of two or more bits – though we can’t correct those. Data with errors of two or more bits will still need to be re-sent, but this is a huge step in the right direction.

It also might remind some of you of a particular approach to data storage redundancy…

Smells like RAID in here

The way we check columns and rows with Hamming Codes may remind you of RAID’s ‘striping’ technique. RAID in general will use ‘mirroring’ or ‘striping’ to preserve data. ‘Mirroring’ is the creation of disk drive duplicates – RAID1 uses mirroring alone to preserve data. The idea is similar to what we discussed at the beginning of this post regarding keeping duplicates of books, though in this case, the books are all in the same library. Still, if one copy of the book is destroyed, the remaining copies in the library can still be used.

‘Striping’ on the other hand, does not create clones of disk drives alone, but also breaks up data across multiple drives. The RAID approach that will sound the most familiar in terms of Hamming Codes is RAID5, which requires a minimum of three disk drives, and offers striping with parity. Data is broken up into pieces, which are distributed across multiple hard disks, adding parity blocks to protect its durability (one parity block for every two data blocks). Much like Hamming Codes, if a drive fails, the missing data can be reconstructed using the data on the other disks. However, just like Hamming Codes, RAID 5 can support only one disk failure at a time.

We’ve been talking about sending and receiving messages, but these concepts also all apply to storage. When it comes to storing data, the sender of the ‘message’, our data, is the past – and the recipient is the future. RAID5’s approach, based on Hamming Codes, is an attempt to preserve a data for the future, correcting any errors that may occur while it is in storage.

Mirroring and striping certainly assist with maintaining data durability over time, but only being able to support the loss of one drive out of every three is still not that great. But where do we go from here?

The Reed-Solomon code

Ten years after Hamming published his paper on Hamming Codes, Gustave Solomon and Irving Reed took a look at this problem from another angle, publishing their own paper, Polynomial Codes over Certain Finite Fields. This paper describes the Reed-Solomon codes, a group of error correction codes used in many data storage technologies (for example, RAID6 and DVDs) along with many data transmission technologies.

This is where we start to touch on some deeper, and more complex mathematics. We’ll just look at the two essential ideas underpinning their approach, Galois fields, and polynomial interpolation.

Galois fields

Evariste Galois who was a French mathematician who would be the ideal poster child for “taken too soon.” Fascinated by mathematics, by age 17 Galois had already published several notable papers, despite the fact that his university application had been rejected because his examiner couldn’t follow his train of thought.

Galois was not just passionate about mathematics, but also politics, and it was the latter that unfortunately cut his life short. He wound up in prison a number of times for his criticism of Louis Philippe, the last King of France, and eventually wound up dying at age 20 in a duel with an officer. The night before his death, he didn’t sleep a wink, instead staying up to write down as many of his as-yet unpublished mathematical ideas as possible. What a legend.

Amongst his many ideas were the outlines for Galois fields. In extremely non-mathematician terms, we can think about the need for Galois fields like this:

There are an infinite amount of numbers.
Working through complex mathematical proofs using an infinite amount of numbers takes a very long time (infinity).
If we instead work within a finite set of numbers, then we can work through an idea in a finite amount of time.
Number theorists like this, because it means they can usually get home in time for tea.

With that in mind, just know that a Galois field is finite, and extremely useful when working with polynomials. The count of items (numbers, polynomials or roots of polynomials) in a Galois field is always a prime (or a prime power, known as an extension field).

Essentially:

GF(5) = { 0, 1, 2, 3, 4 }

GF(11) = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 }

GF(n) = { 0, 1, 2, 3, …, n-1}

You can conduct any operation on values in a Galois field, and get an answer which is also within the field, using modulo of the size of the field. This lets us express any sequence of data as points on a polynomial graph, which is essential for the Reed-Solomon code.

Why polynomials?

Before we even go there, step back for a minute and imagine a standard cartesian graph, with an X and Y axis. Consider a single point on that graph, say at (2,2):

How many lines can you draw through that single point?

As many as you like, of course! The options are infinite.

Now let’s step back, and add a second point to the graph. Now how many lines can be drawn that perfectly intersect these two points?

Now, the answer is finite: only one line can connect these two points.

Error detection and cryptography have more in common than you might think, because as well as this concept underpinning the more complex ideas behind the Reed-Solomon algorithm, it’s also a method you can use to send secret messages.

For example:

You and I both agree ahead of time that our ‘secret message’ is always going to have a X axis value of ‘0’. This is the equivalent of our secret key, allowing us to decrypt any new messages we send to each other. All I need to do from then onwards, is give you two points on a graph (two ordered pairs).

Pssst… (2,2) and (-1,4)

By drawing a line between them, seeing where the line intersects with our secret key (X = 0) you can discover the real, secret number I want to send you:

But of course, as well as sending secret messages, this technique can also be used to restore a number that was lost, by having an agreed point of reference. For scalable, clever error detection and correction, however, we need a bit more than a straight line:

The Reed-Solomon algorithm represents sequences of data as points on a polynomial map. And once you’ve done that, you can essentially find other points on the graph to use as parity data. So the graph itself is what you use as your code shards. The idea being that you define a code (like choosing our secret X axis value in our line example) and you can use that on both sides to recover data. It’s highly efficient, though computationally expensive.

An example of encoding using the Reed-Solomon algorithm

Here, k is the number of data shards, and m is the number of code shards.

On the left you have a distribution matrix (also known as a generator matrix), which consists of an identity matrix in the top rows, the same size as your data and m code shards.

The identity matrix essentially acts like a multiplier with a value of 1, resulting in the data you are storing. This same multiplication process with the code pieces results in parity blocks.

Decoding using the Reed-Solomon algorithm

When we come to decode this message, we take the inverse of the generator matrix and multiply it by the data we have left, which will be a combination of data and parity blocks.

It’s this algorithm that allows you to still be able to play a mildly scratched DVD – the algorithm will detect and correct incorrect bits, so you get to watch the copy of Batman Forever you purchased in 1998, even after your Dad dropped his antique coin collection on top of it in 2003, before stashing it in a storage box for a decade or so…

Which brings us to erasure coding

Erasure coding is what allows Ceph, the leading open source SDS, to replicate your data without making complete copies of it, instead reconstructing it from fragments, which can be spread across multiple drives, racks, and data centres, giving you extremely high data durability while keeping your storage capacity requirements to a minimum.

Ceph allows for the creation of traditional replication pools (where you’ll nominate the number of full duplicates of your data, which can be spread around for just as much data durability as erasure coding, but with the additional storage demands) or erasure coded pools. If you choose to use an erasure coded pool, you’ll then need to choose which erasure coding algorithm you want to use. This is because Ceph has a pluggable architecture for erasure codes, so you can specify which one you want to use on a pool-by-pool basis.

Be sure to choose one that best suits your workload before you run a production erasure coded pool, as you of course can’t change your method for that pool once it has been created, as that would require you to re-encode all the data on the fly. If you really need to change it, you’ll have to create a new pool, then migrate your data over.

This example, taken from the Ceph documentation wiki, where k=3 and m=2 (so, two parity shards for every three data shards) depicts the basics of how erasure coding works in Ceph:

Almost all of the erasure coding plugins within Ceph rely on some form of Reed-Solomon. Each variety has its own strengths and weaknesses. The main one is jerasure, which has both Vandermonde and Cauchy versions. The difference between these variants is essentially how they perform the matrix calculations.

The evolution isn’t over

Erasure coding has come a long way from the basic concept of single parity bits, but there’s still plenty of room for improvement. The Shingled Erasure Code (SHEC) method is a recent proposition for a new form of erasure coding, intended to be more configurable than Reed-Solomon and very efficient when recovering data.

There’s also Clay, an erasure code implementation which focuses on reducing network bandwidth, and Locally Repairable Codes, which aims to reduce the impact of single node failure events. Locally Repairable Codes adds a new variable to k (data shards) and m (parity shards). This new variable would be l, for locality, offering an element of control over the locations chosen to store parity shards, useful for when you want to recover large, geographically-distributed clusters. Depending on the cluster, in some cases it may be faster and cheaper to recover missing data shards locally, rather than to try and get the real data shards from the other side of the world. By distributing parity shards across very specific locations, error correction can be attempted over a local network, when possible.

For now, the key thing to remember when working with erasure coding in Ceph is to choose values for k and m that reflect exactly how durable your cluster needs to be, versus how much you are willing to spend on redundant storage.

Did you try out the challenges in today’s blog post? You’ll find the answers to each question below!

Solution: Parity bit calculation

11110110
01111011
00001100

Our bolded numbers are the parity bits, which ensure an even number of ‘1’s in every string of data. Go back to the parity bit question

Solution: Credit card check digit calculation

Sadly, it’s just not possible to have the luckiest credit card number in the world, as you can see in the answer below:

Go back to the credit card question

Solution: Transportation start-up logic puzzle

So, who’s bringing back the dinosaurs? Answers below!

Divide By Her0 has a red logo, and is working with cars.
Aaaaa! has a yellow logo, and is working with horses.
S-Power 21 has a green logo, and is working with bicycles.
Yuno Dynamics has a blue logo, and is reviving the dinosaurs!

Go back to the logic puzzle section

Please note: some of the images used in the post sourced with permission from a set of slides my colleague used in a FOSDEM presentation, where he covered the credit card example, Galois, the "Young Turks" and Reed-Solomon encoding. Other images were created by me, purchased from Adobe Stock Photos, or obtained under CC0. I originally published this post on the SoftIron blog, but that post is no longer live due to a strategic change in focus for the company, so I have republished it here as I'm very proud of the research and puzzle-setting I did for this piece!

Fish out of order