File system crash consistency and performance
POSIX guarantees a certain amount of consistency of file operations
during normal operation: E.g., consider an editor editing file F;
after a while, the user saves the file, and the editor implements this
request by writing the new contents to a new file, say #F#, then
renaming #F# into F. Another process opening F will either get the
old file, or the new file.
POSIX does not make any useful guarantees in case of a crash (i.e.,
OS crash, power outage or the like), but what would we like file
systems to do? A naive approach would be to perform all writes to
persistent memory synchronously (i.e., wait for them to complete), and
perform operations such as renames in a way that we see no
inconsistent state. In case of a crash we would see the same file
system state that was current (and visible to other processes) right
before the crash.
While this approach may perform ok for editing, workloads that write a
lot of data (e.g., untarring a file) would be slowed down a lot by
this approach. So file systems usually compromise on consistency in
favour of efficiency. How should they do that? There are at least
- Following this position, the application should tell the file
system through the fsync() system call when it wants a specific
file to be persistent. This position is taken by, e.g., Ted
Ts'o, implementor of the ext4 file system. In our example, this
would mean an fsync() of #F# right after writing its contents,
maybe an fsync of the directory containing #F# after that, and
an fsync of the directory containing F after the rename. If the
application does enough fsyncs (and it's not obvious how much is
enough; I am not aware of a guide for achieving consistency
using fsync), the data will be consistent, if not, there may be
data loss and the file system developer will blame the
application and deny any responsibility.
- The file system guarantees that the persistent state
represents one of the states visible in normal operation before
the crash; i.e., the user may lose some updates that other
processes could see before the crash, but if the application
esured consistency in normal operation, the persistent state
will be a consistent state. In our example F would contain
either the old or the new state, no fsyncs required. However,
you probably want to sync before reporting the completion of a
transaction (say, confirm a purchase) to a remote user. My
position is that file systems should give this guarantee.
Unfortunately, last I looked among the Linux file systems, only
NILFS2 gave this consistency guarantee. The advocates of the
implementation-oriented approach denigrate this
because they want users to think that this is unrealistic.
One advantage of the semantics-oriented approach is that an
application that is debugged for consistency in case of ordinary
operation is automatically consistent (albeit not necessarily
up-to-date) in case of a crash.
A frequent argument for the implementation-oriented approach is that
it provides better performance. It may appear that way if you
benchmark an application with a given number of fsyncs, but that
results in different crash consistency guarantees.
In particular, if the application does not perform any fsyncs, the
implementation-oriented file system does not guarantee any crash
consitency, while the semantics-oriented file system does perform as
consistently as ever.
If the application performs the fsyncs necessary for consistency on
the implementation-oriented file system, the semantics-oriented file
system guarantees that the persistent state represents the logical
state of the file system at some point in time at or after the
fsync, while the implementation-oriented file system does not give
such a guarantee.
But what if we compare variants of the application tuned for the
requirements of the specific file system? I.e., no fsyncs or syncs
for a semantics-oriented file system if we can live with losing the
progress of a certain number of seconds (e.g., when using an editor
with an autosave file, getting the old file and the autosave file
may be almost as good as getting an up-to-date file), but the full
complement of fsyncs for an implementation-oriented file system (an
application could have a flag for enabling or disabling these
However, unlike running a given benchmark, my expectations are hard to
confirm or disprove, even for specific cases, much more so in the
general case: You never know if an application has enough fsyncs for
implementation-oriented file systems (it also depends on the usage
case of the application), unless you throw in an fsync after every
change to a file or directory; and in the latter case the performance
is going to suffer a lot and the advocates of
implementation-oriented file systems are going to complain that you
used too many fsyncs.
- Implementation-oriented file system
- The application is going to perform a lot of fsyncs, at least
as many as necessary for the consistency level that some user of
the application may require, so possibly many more than needed
for the user at hand. And every fsync is going to wait until
the data hits persistent memory and reports back success. This
may be quite slow.
- Semantics-oriented file systems
- For now, let's assume that the application does not need
synchronicity, so it does not perform fsyncs. In that case, the
file system can perform as few synchronous block device flushes
asynchronous barriers) as desired to reduce the maximum time
lost (e.g., one synchronous flush every 5s to guarantee that not
more than 10s of work are lost): First request writes of nearly
all the data and metadata to free blocks, then flush, then
request a write of the commit block makes that data and metadata
visible to the file system; the commit block will become
persistent with the next flush at the latest, and make all the
data persistent to the file system. This assumes that all
writes (except maybe the commit block) go to free blocks, as
happens with a copy-on-write file system, a log-structured file
system, or a sufficiently journaling file system; there is some
cost associated with this, but I expect that it is relatively
Even if you need to sync (e.g., before you confirm a purchase
to a remote customer), this will be much rarer and cost less than
the many fsyncs needed for satisfying implementation-oriented file
systems. And it's clear when it is necessary to perform these
syncs, because they come from application needs.
Given these issues with implementation-oriented file systems, do
you really want to use one if you care for crash consistency?
In any case, keep in mind that running a given benchmark on the
two kinds of file systems usually does not produce crash consistency
results and therefore the performance numbers may be misleading.