Post by Willem Jan WithagenPost by Garrett WollmanIn article
Post by Alan SomersPost by Willem Jan WithagenIs there any expectation that this is going to fixed in any near
future?
Post by Garrett WollmanPost by Alan SomersNo. It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS. Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
I don't think it's true that this is _fundamentally_ impossible. What
the standard requires would in essence be a per-object refreservation.
ZFS supports refreservation, obviously, but not on a per-object basis.
Furthermore, there are mechanisms to preallocate blocks for things
like dumps. So it *could* be done (as in, the concept is there), but
it may not be practical. (And ultimately, there are ways in which the
administrator might manage the system that would defeat the desired
effect, but that's out of the standard's scope.) Given the semantic
mismatch, though, I suspect it's unreasonable to expect anyone to
prioritize implementation of such a feature.
I don't think posix_fallocate() can be compatible with COW. Suppose you
do reserve a fixed set of blocks. That ensures the first write has a
place to write, but not if you overwrite one of those blocks. You'd have
to a block, or you'd have to have a way to mark a file as not COW. The
first case isn't really any better than not using posix_fallocate() in the
first place as you are still requiring writes to allocate blocks, and the
second seems a bit fraught with peril as well if the application is
expecting the non-COW'd file to be in sync with other files in the system
since presumably non-COW'd files couldn't be snapshotted, etc.
I think Garrett's assessment that it is not fundamentally impossible, but
may not be felt to be worth implementing in any given file system for
practical reasons, is correct. I say this having designed/implemented a
COW file system that was driven by customer pressure to do things that at
first pass one might declare represented an architectural contradiction,
but upon further reflection were entirely possible to do given sufficient
willingness to invest the effort and accept the accompanying trade-offs,
additional knobs to turn, etc.
In this case (posix_fallocate() + COW + snapshots), it could be implemented
with a per-object allocator that normally keeps at least one extra block
beyond the reservation requirement on hand, plus a snapshot operation that
in order to succeed has to be able to provision the local allocators of all
fallocated objects with enough additional blocks to maintain the no-fail
write guarantee post-snapshot.
-Patrick