Marco's Blog

All content personal opinions or work.
en eo

Threshold Chunking

2021-02-05 15 min read Programming marco

The Problem

How do you distribute a file in pieces such that you need exactly a given number of the pieces to rebuild the entire file?

This article was inspired by the Shamir Secret Sharing Scheme (SSSS), a cryptographic scheme (process) by which a secret key is shared between $n$ fragments, any $m$ of which can be used to recreate the entire key (n and $m$ being determined at the beginning of the process). Recreating the key from fragments is automatic and requires no further password or other. If you can find $m$ fragments and know which index they each have, you can recreate the key in its entirety.

The key factor here is that you need exactly the number of fragments mentioned, but that any set of fragments can be used. If you have a number of fragments smaller than the threshold $m$, then you are just as far from the key as if you had none. If you have more than the threshold, then you gain no knowledge you wouldn’t already have with $m$ fragments. The particular fragments available are not relevant, only their total number.

The SSSS accomplishes this in a very elegant way. It uses a polynomial of order $m-1$ with random coefficients, except the first (the constant). The term for order 0, the coefficient, is set to the secret. For each index, the value of the polynomial is computed and used as the fragment, with the index itself. We can generate as many values as we like, one for each index. Mathematics tells us that any polynomial of order m-1 is defined by $m$ points, and each of our fragments is a pair of $(x,y)$ values. With this knowledge, we can determine the polynomial, which means we get the secret back.

Notice how with SSSS we can generate as many points as we like, but the polynomial (and hence the secret) is only determined when enough of the fragments are put together. It doesn’t matter which fragments we have, and having more than the threshold is of no utility.

SSSS is extremely efficient and elegant in this way. But is there a way to do something similar with a random file? Can we split up a file into fragments, such that we get $n$ of them and with no less than $m$ of them, but with any $m$ of them, we can reconstitute the file?

Striping

My first approach was to simplify the problem. How do I split a file into $n$ fragments, such that I need at most $m$ of them to reconstitute the file. I figured there is a simple solution if I divide the file into $n$ equal chunks and create the fragments out of those equal chunks in convenient order.

Let’s start with nomenclature. I will from now on refer to “chunks” as consecutive sections of a file, while “fragments” are combinations of these chunks. Each chunk can be found in the original file at a specific offset (which may be unknown to us) and may not be unique. Chunks must not overlap and no chunk can contain parts of a different chunk (except that the data may be the same if it is the same in the original file).

Let’s see what happens if we create fragments composed of consecutive chunks modulo the number of them, $n$. This type of fragment is called a “stripe.” A stripe is a longer section of a file than a chunk, but may loop back to the beginning of the file if it starts towards the end. For instance, if we have 5 chunks $[c0 … c4]$, then the fragment composed of the chunks c1, c2, c3 is a stripe starting at c1 and of length 3. A stripe starting at c4 and of length 3 would have to loop back and be composed of c3, c4, c0. By their nature, stripes keep their initial chunk at the original file offset and chunks from the beginning of the file are added at the end of the stripe.

Let’s assume we have 5 stripes of length 3. How many of them do we need in the worst case to recombine the file? That is, how many stripes are required such that each chunk is present in at least one stripe? Here is a table:

S0 S1 S2 S3 S4
C0 x x x
C1 x x x
C2 x x x
C3 x x x
C4 x x x

Obviously, one stripe is insufficient. Two stripes may be sufficient, as we see that S0 and S2 have all five chunks. S0 and S1, though, are missing a chunk. Now we know that we may be able to reconstitute the file with two chunks, but we may need more.

Notice how, if we take S0 and S1, we are missing the last chunk. That one, though, is present in all other stripes. Which means that, if we choose S0 and S1, any of the other stripes will combine with the first two and yield the entire file.

More in general, we can see that each chunk, from C0 to C4, is missing in exactly two columns. That means that if we pick any three columns, we are sure that one of them has an x in at least one row. That means that we can rebuild the file from any 3 stripes.

How do we generalize this result? Let’s think of a different case, 7 stripes with 3 chunks per stripe:

S0 S1 S2 S3 S4 S5 S6
C0 x x x
C1 x x x
C2 x x x
C3 x x x
C4 x x x
C5 x x x
C6 x x x

You will notice that we need the first 5 stripes to get all the chunks together. At the same time, if we pick the first stripe and the fourth, any of the last three will yield the file. We can pick 4 stripes that do not complete the file, but we can pick 3 that do.

We are currently interested in the most stripes required, which is the first number, 5. Notice that, by design, the stripe $S{x}$ is made up of the chunks $C{x}, C{x+1}, C{x+2} (modulo 7)$. We get the last chunk C6 in the stripe S4 for the first time in the sequence. That’s because a stripe of length 3 ends with the chunk 3 – 1 off the offset. More in general, if we have $n$ stripes and chunks and each stripe is made of $m$ consecutive chunks, we need $n-(m-1)$ or $n-m+1$ consecutive stripes in order to cover the entire file.

That’s good. But is that enough? Does any combination of $n-m+1$ stripes of length $m$ cover the entire file? The answer is, again, in the rows and columns. As we see in the table above, every chunk is present in $m$ rows. Because there are $n$ stripes (columns), every chunk is missing in exactly $n-m$ columns. That means that if we put together $n-m+1$ stripes, we must have all chunks, since each of those is missing in only less than that stripes.

If we have $n$ chunks per stripe, we need at most $n-m+1$ stripes to get the all chunks and the file back. If we reverse the logic and call the threshold $t$ and keep the other names the same, then $t=n-m+1$ or trivially $m=n-t+1$. In other words, if we set the number of stripes to $n$ and the threshold to $t$, stripes of length $n=t+1$ satisfy our requirement that any $t$ stripes will reconstitute the file.

Threshold Chunking

As mentioned, the algorithm above guarantees that at most the threshold of stripes is required to reconstitute the file, but we know that much fewer might suffice. We can easily see that with any stripe length, if we pick out of consecutive stripes the ones that don’t overlap, we can cover the file with a potentially very small subset. If we have the length 5, for instance, then we can cover S1, S2, S3 with the chunks present in S1 and S4. That means that the set S0, S4 has the same information as the set S0, S1, S2, S3, S4.

Is there a different way to partition a file, such that we need at most the threshold of fragments to reconstitute the file, but also at least that number? Then answer is unsurprisingly yes, and we will get there with the same table approach that we have used for stripes.

First, we start with the row and column logic. We saw with the stripes that each row contained exactly $n-m+1$ letters x, which meant that any set of $n-m+1$ stripes contained each chunk somewhere. Now we want something similar, but different: we want that any set of $t-1$ fragments miss at least one chunk. How do we do that?

Before I reveal the answer, please notice that if $t-1$ fragments already miss at least one chunk, then fewer than that number also miss at least one chunk. That’s because for each set of fragments of size less than $t-1$, there must be a set of fragments of size $t-1$ that contains all the fragments of the original set. That means we can get to the original set from the larger set, by taking away some fragments. Since we can’t get more chunks by taking away fragments, the smaller set can’t contain all chunks.

Now, let’s create the set of all combinations of $n$ fragments of size $t-1$. Let’s say we have fragments $F0$ to $F_{n-1}$. We will then have the combination $F0, F1, … , F_{t-2}$, etc. Let’s say we divide the file into chunks, one per combination. Now let’s say that we remove the chunk with index $a$ from the combination with index a. That is, the chunk will be missing in all the fragments that are included in the combination, but will be present in all other ones.

Obviously, by design, any combination of fragments of size $t-1$ will miss one chunk (at least). This chunk, though, will be present in any of the other fragments, also by design. That is, we automatically have a partitioning of the file that satisfies our requirement.

To show it in a more graphical form, let’s look at our example above. 5 fragments with the threshold 3. A set of 5 elements F0, … , F4 has exactly 10 combinations of 3-1 elements. Let’s divide the file into 10 chunks, shown in the table below, and let’s them correspond to each of the combinations in turn:

F0 F1 F2 F3 F4
C0 – F0,F1 x x x
C1 – F0,F2 x x x
C2 – F0,F3 x x x
C3 – F0,F4 x x x
C4 – F1,F2 x x x
C5 – F1,F3 x x x
C6 – F1,F4 x x x
C7 – F2,F3 x x x
C8 – F2,F4 x x x
C9 – F3,F4 x x x

As you can see, each row by design is missing an x in exactly 2 columns, ensuring that every chunk is present somewhere in a set of three fragments. Also important, any combination of two fragments is listed by design in the left-most column, corresponding to one chunk, which is missing in both fragments. That means that no more and no less than 3 fragments are required to reconstitute the file.

Take for example the row marked C4. It corresponds to the combination F1, F2, which means we mark C4 present in all fragments except F1 and F2. By doing so, we know that if we only have fragments F1 and F2 available, they must miss this particular chunk C4. Since we did this for all combinations of fragments, we know that each combination is missing a different chunk.

Because of the constructive algorithm, we can also easily build the fragments as follows:

  1. Determine the list L of all combinations of $t-1$ elements of $n$ fragments (for instance, using itertools.combinations in Python)
  2. Divide the file into length(L) chunks of equal length
  3. Reverse each combination to determine whether a chunk belongs into a fragment or not.
  4. Build the fragments from the reversed list

Automatic Chunk Length

We need to store the length of each chunk in the file or we won’t be able to determine boundaries. If we restrict ourselves to the constructive algorithm above, though, we can determine the length algorithmically without need for extra storage.

First, we notice that we can in general divide a file into chunks of equal length, as long as the file is longer than the number of chunks. That is, a file of 6 bytes cannot be divided into 10 chunks of bytes; a file of 632 bytes can be divided into 10 chunks of bytes.

Second, we remark that one chunk may be different in length than the other chunks. The only case in which all chunks can be of the same length is if the length of the file is a multiple of the number of chunks. Let us resolve that the we want the last chunk always to be of different length than the other ones by changing the chunk size so that the last chunk is the only that can be different in length.

Let us put together our fragments by combining the necessary chunks in the same order as they appear in the file. Fragment F2 above would be composed of the chunks C0, C2, C3, C5, C6, C9.

If we know the number of fragments and the threshold, we can compute the number of chunks, which leads to the size of each chunk. The last chunk is always of different length, and as such fragments will have two different lengths. The fragments that contain the last chunk will be of different length than the other ones.

Because we know the algorithms generating the number of chunks (from the number of combinations) and the size of each chunk (one n-th of the file, with n the number of combinations, except if the file size is a multiple of the chunk length), from the known total number of fragments, threshold, and index (all of which are required for SSSS to work) we can get the original file size and chunk size. Knowing the chunk size, and knowing that the last chunk is last in the fragment, we can regenerate the chunks in turn and from there recombine the file.

Number of Fragments vs. Threshold

If you recall, the combination formula is symmetric along the middle of the number of elements. That is, the number of combinations of $k$ elements out of $n$ is the same as that of $n-k$. In our case, that means that the number of chunks for small thresholds and for large ones is the same, and that the number of chunks is maximum for a given number of fragments at the middle of this number.

Edge Cases

This algorithm works very reliably in the intended case – distributing a file of medium size into fragments, such that a subset of them have to be recombined to get the original file back. The same algorithm doesn’t work too well when the number of fragments or the threshold are small or large, or when the file to be distributed is large or small. Let us consider these edge cases in turn.

1. Small Number of Fragments

In the case of a single fragment, we are not splitting the file at all and instead keeping it whole. This is a possible case, of course, albeit a pretty boring one. It should be special-cased, because there is no need for computation of chunks and combinations.

The case of two or more fragments is described in the general case.

2. Large Number of Fragments / Small File Size

If the number of fragments is large enough, the chunk size can get smaller than a single unit (byte), in which case chunks of the same length cannot be formed. Because of the way combinations are formed, the total number of chunks is maximized when the threshold is 1/2 of the number of fragments, according to the combinatoric formula. This is particularly relevant when the file to be fragmented is small.

It is advantageous to reduce the total number of chunks in this case. A solution, albeit fraught with the problems mentioned above, is to use the striping scheme. In it, chunks are much larger, there being only $n$ chunks in a file of $n$ fragments.

3. Large File Size

Instead of splitting the original file into a given number of chunks, each of them potentially large and of arbitrary size, it might make more sense to split the file along convenient boundaries (say, at 1k chunks) and modify the algorithm above to consider all chunks modulo the number of combinations into a fragment.

Let us consider the case of 5 fragments with threshold 3. As we saw, this leads to a “chunking constant” of 10. Instead of splitting a 1 GB file into 10 chunks of 100 MB each (the last one slightly smaller), we could split the file into 1000 chunks of 1 MB each. If the first fragment contains the first chunk, it would also contain the 11th, 21st, 31st, etc. chunks. This is particularly useful for memory management purposes, since the chunks are kept in memory, but the fragment is written to disk.

Since this changes the algorithm used to generate the chunks, joining the fragments will also have to take this into account.

Final Comments

While inspired by the Shamir Secret Sharing Scheme (SSSS), which also splits data according to a set number of shares with a threshold requirement to reconstitute, Threshold Chunking (TC) is entirely different in its implementation. In SSSS, a property of polynomials is used to hide the secret. Each share handed out is entirely useless, until the threshold is reached. When the threshold is reached, additional shares do not add information.

In TC, the data stored in the fragments is part of the shared data. While the original data can only be completely reconstituted once the threshold of fragments is reached, the data itself is stored plainly. As such, TC by itself is not as useful as in combination with SSSS. By combining TC and SSSS, the benefits of both are combined: TC allows for the storage of an unlimited amount of data in fragments while SSSS can only store a limited amount of information in the secret. SSSS allows for the storage of a limited secret in fragments that are entirely useless until a sufficient number of them is retrieved.

Together, SSSS and TC allow for the storage of unlimited amounts of data in fragments, such that no information about the data aside from the total size can be inferred from the fragments until a threshold number of them is reached, in which case all the original data becomes available.