Plasmakv_intro


Plasma Key/Value Databases

PlasmaKV is a library for accessing key/value databases stored in PlasmaFS filesystems. The library is accompanied by a command-line utility plasma_kv, which we use in the following to demontrate what PlasmaKV can do (see Cmd_plasma_kv for documentation).

A database maps unique keys to values, i.e. it is just a simple index like NDBM. The keys have a limited size which must be specified when the database is created. The values can have any size.

Using plasma_kv

Example: Create a new database /demo with keys up to 128 bytes length:

$ plasma_kv create -db /demo -max-key-length 128

This actually created three files:

$ plasma ls /demo.*
-rw-r--r-- gerd gerd         0 2011-10-27 00:36 demo.data 
-rw-r--r-- gerd gerd         0 2011-10-27 00:27 demo.del  
-rw-r--r-- gerd gerd     65536 2011-10-27 00:36 demo.idx  

The file with suffix .data contains the key/value pairs in sequential order. In the file with .del as suffix pairs are accumulated that have been marked as deleted. The .idx suffix is used for the index mapping keys to locations in the .data file. The index is a B+ tree, and thus implicitly sorts the keys lexicographically.

Insert an entry:

$ plasma_kv insert -db /demo my_key <my_value

Now the key "my_key" maps to a value which is taken as the contents of the file "my_value". This insert function can also work over lists of keys (options -files and -keys-and-files).

Look the entry up:

$ plasma_kv lookup -db /demo my_key

There is also a way to delete keys, and to vacuum the database.

The API

The Ocaml API is provided by the module Pkv_api. See there for details.

Features

Now, why would you want to use PlasmaKV? It has a number of extremely interesting features:

There are also some points that have a good and a bad side:

Applications

The typical application

Implementation

As PlasmaFS supports transactional file operations, the implementation of the database library is stunningly simple, and the core of the library takes less than 2000 lines of code. From the programmer's view it looks like as if we do not take any concurrent accesses, any consistency or atomicity problem into account. We just append new entries to the .data file, and update the .idx file in-place. The whole "trick" is to wrap these evident operations into snapshot transactions. So, what's that?

PlasmaFS supports transactions like SQL, and this means one can run file operations (like lookup, read, write, rename) between the special start_transaction and commit_transaction directives to get transactional behavior. The effects of the transacted operations is hidden from other users until commit_transaction has run successfully. We do not want to go too deeply into the implementation of this, but this makes already clear why we can just modify the files without having to take other users into account. First at commit_transaction time the other users can see the changes, and the other users see all changes at once (a commit is atomic).

Details of the transactional scheme

A particular detail of the machinery is quite interesting, though. When a data block of a file is modified, PlasmaFS does not allow it to modify the block in-place (i.e. to change the block directly). The primary reason is that this causes unmanageable consistency problems. Instead, PlasmaFS allocates a replacement block somewhere, and stores a copy of the original block there after applying the requested changes. When this is done inside a transaction, these replacement blocks just accumulate, and at commit time, the block list is modified, and the so-far hidden replacement blocks become official. The original blocks are available over the whole time of the transaction, so when the file is read-accessed at the same time, the readers just access the original, unmodified blocks. To make the story complete, the original blocks are even specially protected after the commit when there are still readers. The readers keep their old view this way.

Now, this does not yet explain completely how it comes that readers have a consistent view. The mentioned mechanism works only for the case when readers have already accessed a block - it is guaranteed that once a block is read, the readers see the same state for the rest of their transaction. However, the readers would see committed changes of blocks the readers have not yet accessed but will access. This could cause inconsistencies. The solution is called "snapshot transaction" in PlasmaFS. This mode increases the isolation level, and the readers are guaranteed to see the state from the beginning of a transaction throughout the remaining transaction, even if writers commit in parallel. Snapshots are not implemented in the server, but just in the client - it is sufficient to download the complete block list to get the right level of protection from the server. Remember that the namenode does not know whether a block is actually accessed or not. It only knows which blocks it has made known to the accessing client. Because of this it is sufficient to download the block information for a file range to snaphot-protect this range. This is now just done for the complete block list, as we want to take a snapshot of the whole database.

All the complicated management of the transactions is implemented in the PlasmaFS namenode server. The PlasmaFS client just uses the provided file operations in a tricky way to get the right isolation level. For the database code everything becomes absolutely simplistic: By just enclosing the file operations into a snapshot transaction, all the fascinating concurreny, consistency, and atomicity properties are automatically ensured without additional coding.

Bypassing the namenode

In a larger PlasmaFS cluster, the namenode could become a bottleneck: all file transactions need the namenode at some place. For a high-performance database this is a problem, because the speed of the database should not be limited by overloading the namenode.

At least for reading the database, the implemented transactional scheme has an interesting side-effect. The namenode is, after opening the files, only needed for loading the parts of the block list that are not yet known to the client. However, we already mentioned that we load the whole block list at database opening time (for taking snapshots). The consequence is that the namenode is no longer accessed after that! All read accesses have now only to load data blocks and can do this directly from the data nodes.

The library provides a function for checking whether the database has been updated since the snapshot was taken. This function, of course, needs the namenode, but it is really the only one, and you normally call it only once in a while.