File system crash consistency and performance

Crash consistency

POSIX guarantees a certain amount of consistency of file operations during normal operation: E.g., consider an editor editing file F; after a while, the user saves the file, and the editor implements this request by writing the new contents to a new file, say #F#, then renaming #F# into F. Another process opening F will either get the old file, or the new file.

POSIX does not make any useful guarantees in case of a crash (i.e., OS crash, power outage or the like), but what would we like file systems to do? A naive approach would be to perform all writes to persistent memory synchronously (i.e., wait for them to complete), and perform operations such as renames in a way that we see no inconsistent state. In case of a crash we would see the same file system state that was current (and visible to other processes) right before the crash.

While this approach may perform ok for editing, workloads that write a lot of data (e.g., untarring a file) would be slowed down a lot by this approach. So file systems usually compromise on consistency in favour of efficiency. How should they do that? There are at least two positions:

Implementation-oriented
Following this position, the application should tell the file system through the fsync() system call when it wants a specific file to be persistent. This position is taken by, e.g., Ted Ts'o, implementor of the ext4 file system. In our example, this would mean an fsync() of #F# right after writing its contents, maybe an fsync of the directory containing #F# after that, and an fsync of the directory containing F after the rename. If the application does enough fsyncs (and it's not obvious how much is enough; I am not aware of a guide for achieving consistency using fsync), the data will be consistent, if not, there may be data loss and the file system developer will blame the application and deny any responsibility.
Semantics-oriented
The file system guarantees that the persistent state represents one of the states visible in normal operation before the crash; i.e., the user may lose some updates that other processes could see before the crash, but if the application esured consistency in normal operation, the persistent state will be a consistent state. In our example F would contain either the old or the new state, no fsyncs required. However, you probably want to sync before reporting the completion of a transaction (say, confirm a purchase) to a remote user. My position is that file systems should give this guarantee. Unfortunately, last I looked among the Linux file systems, only NILFS2 gave this consistency guarantee. The advocates of the implementation-oriented approach denigrate this as O_PONIES, because they want users to think that this is unrealistic.

One advantage of the semantics-oriented approach is that an application that is debugged for consistency in case of ordinary operation is automatically consistent (albeit not necessarily up-to-date) in case of a crash.

Performance

A frequent argument for the implementation-oriented approach is that it provides better performance. It may appear that way if you benchmark an application with a given number of fsyncs, but that results in different crash consistency guarantees.

In particular, if the application does not perform any fsyncs, the implementation-oriented file system does not guarantee any crash consitency, while the semantics-oriented file system does perform as consistently as ever.

If the application performs the fsyncs necessary for consistency on the implementation-oriented file system, the semantics-oriented file system guarantees that the persistent state represents the logical state of the file system at some point in time at or after the fsync, while the implementation-oriented file system does not give such a guarantee.

But what if we compare variants of the application tuned for the requirements of the specific file system? I.e., no fsyncs or syncs for a semantics-oriented file system if we can live with losing the progress of a certain number of seconds (e.g., when using an editor with an autosave file, getting the old file and the autosave file may be almost as good as getting an up-to-date file), but the full complement of fsyncs for an implementation-oriented file system (an application could have a flag for enabling or disabling these fsyncs).

Implementation-oriented file system
The application is going to perform a lot of fsyncs, at least as many as necessary for the consistency level that some user of the application may require, so possibly many more than needed for the user at hand. And every fsync is going to wait until the data hits persistent memory and reports back success. This may be quite slow.
Semantics-oriented file systems
For now, let's assume that the application does not need synchronicity, so it does not perform fsyncs. In that case, the file system can perform as few synchronous block device flushes (or, theoretically, asynchronous barriers) as desired to reduce the maximum time lost (e.g., one synchronous flush every 5s to guarantee that not more than 10s of work are lost): First request writes of nearly all the data and metadata to free blocks, then flush, then request a write of the commit block makes that data and metadata visible to the file system; the commit block will become persistent with the next flush at the latest, and make all the data persistent to the file system. This assumes that all writes (except maybe the commit block) go to free blocks, as happens with a copy-on-write file system, a log-structured file system, or a sufficiently journaling file system; there is some cost associated with this, but I expect that it is relatively moderate.

Even if you need to sync (e.g., before you confirm a purchase to a remote customer), this will be much rarer and cost less than the many fsyncs needed for satisfying implementation-oriented file systems. And it's clear when it is necessary to perform these syncs, because they come from application needs.

However, unlike running a given benchmark, my expectations are hard to confirm or disprove, even for specific cases, much more so in the general case: You never know if an application has enough fsyncs for implementation-oriented file systems (it also depends on the usage case of the application), unless you throw in an fsync after every change to a file or directory; and in the latter case the performance is going to suffer a lot and the advocates of implementation-oriented file systems are going to complain that you used too many fsyncs.

Given these issues with implementation-oriented file systems, do you really want to use one if you care for crash consistency?

In any case, keep in mind that running a given benchmark on the two kinds of file systems usually does not produce crash consistency results and therefore the performance numbers may be misleading.


Anton Ertl