Holey Copy-on-Write

Navigation

Documentation

Assumptions

The base assumption is a shared-nothing replicated application running on a shared-storage cluster as follows:
Each node in the cluster keeps a full copy of the entire storage. It is up to the application to ensure that when one of the copies is modified, that write operations are propagated to other replicas. Read-only operations can be executed at a single replica, according to an application specific load-balancer policy. The resulting consistency criteria is therefore defined by such replication and load-balancing mechanisms. The transformation to a shared-storage cluster assumes also that all replicas can safely read a common storage area. This is trivially satisfied if such area is a block device, or by using a shared file-system such as OCFS2, if the shared area is a file.

Problem

Notice that a shared-storage version cannot be easily achieved by departing from the architecture depicted above. If all replicas are naively configured with the same storage volume, data corruption ensues. This happens because, even if the replication mechanism is logically a replicated state machine, hence deterministic, the mapping from logical to physical layer is not. For instance, if the application is a database and different replicas deterministically add some tuple to a table, it is likely that such tuple ends up being stored in different disk blocks due to internal concurrency.

Normal Operation

The proposed HoleyCoW architecture is as follows:

It takes advantage of all replicas being able to read the shared volume, although only one of the replicas is allowed to write back (writer). Other replicas (copiers) keep a copy-on-write overlay. All read, write and sync requests of the application are intercepted by the HoleyCoW layer to enforce such rules.

In detail, it works as follows:
When the writer replica issues a write request (1), it is blocked and all replicas notified (2) to fetch the previous value of the data to their local snapshot (3). Upon receiving requests from all replicas (4), i.e. stable, the write request can proceed (5). Notice that this needs to be done only once for each storage block. Write requests from copier replicas are directly routed to the snapshot copy without further interaction.

Read requests at slaves are first served from the local snapshot and if unavailable, because yet unwritten, from the backing storage. The writer always reads from backing storage.

When a writer replica issues a sync request, to flush operating system caches to disk, HoleyCoW start by ensuring that all previously written blocks are stable. And only then issues the operating system sync operation.

Garbage Collection

The procedure described in the previous section does not however not sufficient to ensure that space used is bounded. In fact, by itself it would lead to copies growing to be full copies of the storage volume, thus defaulting back to the shared-nothing scenario.

In fact garbage collection of a copy can be done at any time provided that the corresponding application instance has been stopped. Therefore, one has to periodically stop and restart each of the replicas to make their corresponding snapshots vanish. This has a small impact in system performance as long as the cluster is sufficiently large and traffic can be transparently rerouted to the remaining replicas. In detail:

Such periodical restart has been suggested as appropriate even in a pure shared nothing scenario for dependability purposes, and is called proactive recovery.

Failure and Recovery

First, one has to consider a failed copier replica. This will eventually block writes to disk and thus has to be dealt with. Fortunately, it can easily be solved by simply removing the replica from
the writer configuration and fencing it from the shared storage:

The failure of the writer has a much smaller impact, as the system will not block. It will however diverge increasingly due to the impossibility of a restarted replica obtaining a current version of the shared storage. The recovery mechanism is also simple and starts with the old replica being fenced from the shared volume and then a new replica taking over as the writer by dumping its local copy back to shared volume.

More...

More information can be found in the technical reports page.

Subpages (1): reports