Thursday, May 19, 2005

WODFS storage architecture

An off-the-cuff description of the WOD's storage system

OK, I admit it. We shamelessly ripped off the idea for how to break up and store data from the Venti filesystem papers.

So the basic idea is that each file is broken up into a number of blocks. Each block is 1024 bytes in size (for now, more later on variable-sized blocks) and its fingerprint is the SHA-1 hash of the data in the block.

A given block can be either a data block or a pointer block. Pointer blocks have pointers to other pointer blocks or to data blocks. Each file is conceptually a tree of pointer blocks that end up pointing to leaf data blocks. Because has as its fingerprint the hash of the data in the block, each block can be interpreted as either a data block or a pointer block. This is not an issue to begin with but as the system fills with data, we project that this will result in considerable savings in terms of space since the data and the data structure are expressed in the same way.

In addition, each file is uniquely identified by a fingerprint which is the hash of its contents: the leaf data blocks are referenced by the pointer blocks all the way up to a root block for the file; that block's fingerprint is the file's fingerprint. Similarly, the entire filesystem is represented by a single root fingerprint. This root fingerprint is stored between sessions and is, in fact, the only item necessary to rebuild the entire system from scratch.

For the moment the type of a block is decided by a flag in the pointer block that points to it. This is a bit fragile since there is a limited amount of space for flags and thus the scheme could become unwieldy at some point in the future.

Since blocks are context-neutral they can be shared by different instances of the WOD running on the same machine.

The blocks are stored on the disk contiguously without any specific order (this sucks because retrieval is via random-access only). To speed up block retrieval, an index maps fingerprints to offsets in the data "file". This index can be rebuilt from the data "file" at any time and so it is not crucial to the functioning of the system. Theoretically the index could be stored within the system itself and loaded after the system is initialized. In this way the bootstrap routines would be a bit slower but once the index was loaded the whole system would speed up.

Right now the big hit for performance is in what we call the chainer. This part of the system takes a contiguous stream of bytes and breaks it up into a tree of blocks. It would be really nice to figure out a way to derive the "chained fingerprint" (i.e. the root fingerprint for the processed file) from the initial data in order to be able to skip the chaining phase when it is not required. Already the system discards write hits for blocks that are pre-extant in the data "file"; it would be nice to bring that up a level so that entire files could be discarded if their contents were known (this could be trivially done by adding the original hash to the root block since the original hash of all files with the same root block would be identical).

No comments: