Tahoe-LAFS Tutorial (Part 1) | Decentralized Cloud StorageCategory: cloud
A 7 Minute Read
07 Apr 2017
Modified Image By Jason Baker
I like cryptography, not because I know the ins and outs of how it works, or because I like arguing on /r/crypto about using AES in CBC, ECB, ABC, or 123. I get the gist of the overall mechanics of it, but to me the allure is simply in its ability to keep secrets from prying eyes, no matter how strong they are. Crypto levels the playing field.
It’s this same fascination that brought my attention to Tahoe-LAFS, a fairly complicated, slightly arcane file system that can theoretically allow you to back up your data and distribute it across servers owned by the CIA, NSA, and GCHQ, without any of them being able to peek in. What’s more is that if one of them took their server down, you would still be able to piece together your data from the other two.
Unfortunately, the CIA, NSA, and GCHQ don’t offer Tahoe-LAFS storage (bummer, right?), but this mental exercise, albeit ridiculous, showcases just how powerful well-designed software can be. And indeed, Tahoe-LAFS is powerful, well-designed software.
This tutorial is the first of three parts. This part focuses on explaining what Tahoe-LAFS is and the broad mechanics of how it works as simply as possible, without losing too much important detail. Indeed, while you could likely get the gist of Tahoe-LAFS across in a few sentences, to fully appreciate it, nevermind deploy it, you need to dig your teeth into it to a degree.
The second part will teach you how to actually set up Tahoe-LAFS and anonymously back up files to a decentralized storage network on I2P. The third and final part will go one step further, and will explain how you can help others back up their data with Tahoe-LAFS by setting up a Tahoe-LAFS storage server on I2P.
Tahoe-LAFS: The Overview
To understand Tahoe-LAFS, it is helpful to understand that LAFS stands for Least Authority File System. This refers to the Principle of Least Authority, which essentially means that a user or a program should only have access to the bare minimum of what it justifiably needs to do its job. For example, an Android app that only functions as a calculator would be breaking the principle of least authority if it were to demand permission to access your microphone and GPS, two sensors that it clearly has no need to access in order to calculate numbers.
Tahoe-LAFS applies this principle to managing data across multiple computers (think ‘the cloud’) by designing a system that utilizes cryptography such that data is only readable by the user who uploads it. This seems logical, but Dropbox, Google, OneDrive, and almost every other cloud provider breaks this principle by designing their systems such that they can read every byte of their users’ data.
With that said, Tahoe-LAFS is not a cloud provider. It is a generic program that anyone can run to distribute files securely across multiple computers. Therefore, if you want to use Tahoe-LAFS to back up your data, you will need to find a place to store it.
You can do this by either setting up a Tahoe-LAFS server somewhere yourself, or finding a grid run by others. A grid is essentially just a network of Tahoe-LAFS servers that each offer a certain amount of storage space. Several exist, including the $25 per month S4 grid run by the Tahoe-LAFS developers, as well as the free grid over on I2P run by volunteers (which we will return to in the next part of the series).
Due to the fact that Tahoe-LAFS encrypts everything before uploading to a grid, you do not need to trust grid operators whatsoever. All they do is provide storage on their hard drives, and couldn’t see what you’re storing, nor modify what you’ve stored, no matter how hard they try.
Furthermore, if, for example, some of those volunteers on the I2P grid decided to stop and shut down their servers that happened to house your data, depending on how you configured Tahoe-LAFS on your own computer when you uploaded your files, there’s a good chance your data would still be accessible. This is because Tahoe-LAFS utilizes what is known as erasure encoding.
Erasure encoding lets you, for instance, break a 1GB file into three smaller blocks, each a bit larger than 1/3 of 1GB (i.e. a bit larger than 333mb). At first, this seems like a problem, because it means that a 1GB file might require ~1.3GB of storage. However, erasure encoding in this way means that you could lose any one of those blocks and still be able to reassemble the whole file from the remaining two.
Tahoe-LAFs calls these blocks shares, and you can fully customize how you use shares. What I presented is a 2-of-3 scheme, meaning that you need 2 shares out of 3 shares to restore the file. By default, Tahoe-LAFS uses a 3-of-10 scheme, meaning that your file is broken up into 10 shares, and you only need 3 of those shares to restore your file. Of course, the more shares you add, the more data you have to send whenever you upload those files, so it is a trade-off.
In short, Tahoe-LAFS provides a way to store your files securely on other people’s computers, without having to trust them not to look at or modify your files, and without needing them all to always be online.
How It Works: The Basic Mechanics of Tahoe-LAFS
To illustrate how Tahoe-LAFS works, let’s run through what happens when you upload a file onto a grid. This will be a simplified explanation, so if you want the juicy technical details head on over to the official documentation.
The first thing that happens when you upload a file is that it is fully encrypted (using AES, if you care).
Next, the encrypted file is broken up into shares (erasure encoded) based on your configuration (again, the default is 3-of-10). Hashes, which let you check to make sure files haven’t been tampered with, (click here to learn more about hashes) are then created for the encrypted file and for each share, which eventually get stored alongside each share on the grid.
A capability is then produced based on the file. A capability is essentially just a string of text that includes (1) the decryption key, (2) information to help reassemble the different shares, (3) a permission (read and write, for example), and (4) a hash of the hashes stored on the grid to make 100% sure nothing has been tampered with. Put simply, these four pieces let you use one simple string of text to both locate and access your data on the grid, and make sure that nobody has tampered with it. Think of it as a fancy link to your data.
With the data ready to begin uploading, your Tahoe-LAFS client then asks a special server, known as an introducer, where other servers offering storage are, and how much storage they’re willing to provide. In other words, the introducer introduces you (get it?) to servers willing to store your data. Then, depending on how many shares you set to create, your client will pick as many servers and begin uploading shares to them, while attempting to make sure that only 1 share is put onto each server for maximum redundancy.
After the data is uploaded, the Tahoe-LAFS client will do a ‘health check’. Part of the share configuration includes setting how many separate servers you’re happy with having your data stored on. For example, in a 3-of-10 scheme, you might be happy with those 10 shares ending up on 7 different servers, so your ‘servers-of-happiness’ (literally what the setting is called), would be set as 7. Alternatively, if you really want maximum redundancy, you might set it to 10.
In any case, during the health check your client will make sure that your shares are spread across as many servers as your ‘servers-of-happiness’ setting, and the upload will not be considered successful unless that threshold is met. When it is met, the upload will be considered healthy and complete. At this point, the job is done and your data is safe and sound… almost.
Consider this scenario: you generously set up a Tahoe-LAFS storage server, and over time people upload several hundred gigabytes of data to it. Eventually, this data becomes too much to store, and much of it is several years old, possibly abandoned.
It is because of scenarios like this that Tahoe-LAFS offers a tool for storage server operators to delete data that has a lease older than a certain age. A lease is just a stamp that the user puts on their shares that effectively says “please don’t delete this file before [insert date here]”. If the user doesn’t renew their lease (put a new stamp on their file) before the lease expires, or if the storage server enforces a maximum lease duration (for example, the storage server may not respect leases longer than 3 months) and the lease isn’t renewed before this duration, then the data will be deleted.
This means that every so often, users need to renew their leases, or risk having their data deleted. Fortunately, there’s a tool for this, and it’s easily scripted so that it can be done automatically according to a schedule. The important part is simply to remember to do this!
Finally, we get to file repair. Eventually, something will happen that results in one of your shares being deleted or taken offline. So, instead of having 10 shares across 10 servers, you’ll have 9, and then 8, and eventually you’ll cross your ‘servers-of-happiness’ threshold. When this happens, you will need to repair your shares by re-uploading some onto new servers. Like renewing leases, there are easily scriptable tools to validate and repair your shares on the grid, and you can now dive into those in Part 2!.
For now, I’ll stop here. This is everything you need to know before setting up Tahoe-LAFS. Check out part 2 to learn how to actually use it. This was a fairly complicated tutorial compared to usual, so feel free to ask questions or provide feedback in the comments section down below!