Discussion:
posix_fallocate on ZFS
(too old to reply)
Willem Jan Withagen
2018-02-10 17:28:15 UTC
Permalink
Hi,

This has been disabled on ZFS since last November.
And I do understand the rationale on this.

BUT

I've now upgraded some of my HEAD Ceph test systems and they now fail,
since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.

Is there any expectation that this is going to fixed in any near future?

--WjW
Alan Somers
2018-02-10 18:24:31 UTC
Permalink
Post by Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and they now fail,
since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in any near future?
--WjW
No. It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS. Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.

-Alan
Willem Jan Withagen
2018-02-10 18:50:28 UTC
Permalink
Post by Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and they now
fail, since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in any near future?
--WjW
No.  It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS.  Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
Yup, that was what I'm going to do.
But then I would like to know how to annotate it.

And I guess that I'd get reactions submitting code to fix this, since
the journal could run out of space.
So I'd beter know what is going on.

I seem to remember that on a pool level is is possible to reserve space
whilest creating a filesystem? And then it could/should be fixed when
building the disk-infra for an OSD.

--WjW
Alan Somers
2018-02-10 19:21:50 UTC
Permalink
Post by Willem Jan Withagen
Post by Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and they now
fail, since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in any near future?
--WjW
No. It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS. Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
Yup, that was what I'm going to do.
But then I would like to know how to annotate it.
And I guess that I'd get reactions submitting code to fix this, since the
journal could run out of space.
So I'd beter know what is going on.
I seem to remember that on a pool level is is possible to reserve space
whilest creating a filesystem? And then it could/should be fixed when
building the disk-infra for an OSD.
Yes, you can easily reserve space for an entire filesystem. Just do for
example "zfs create -o reservation=64GB mypool/myfs" .
Ian Lepore
2018-02-10 19:43:19 UTC
Permalink
On Sat, Feb 10, 2018 at 10:28 AM, Willem Jan Withagen
Post by Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and they now fail,
since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in any near future?
--WjW
No.  It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS.  Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
-Alan
Unfortunately, posix documents that the function returns EINVAL only
due to bad input parameters, so ignoring that seems like a bad idea.

Wouldn't it be better if we returned EOPNOTSUP if that's the actual
situation?  That could be safely ignored.

-- Ian
Alan Somers
2018-02-10 19:45:24 UTC
Permalink
Post by Willem Jan Withagen
On Sat, Feb 10, 2018 at 10:28 AM, Willem Jan Withagen
Post by Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and they now fail,
since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in any near
future?
Post by Willem Jan Withagen
--WjW
No. It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS. Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
-Alan
Unfortunately, posix documents that the function returns EINVAL only
due to bad input parameters, so ignoring that seems like a bad idea.
Wouldn't it be better if we returned EOPNOTSUP if that's the actual
situation? That could be safely ignored.
I'm afraid you are mistaken. Posix _should've_ required EOPNOTSUP in this,
but it actually requires EINVAL.

http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fallocate.html
Ian Lepore
2018-02-10 19:47:48 UTC
Permalink
Post by Willem Jan Withagen
On Sat, Feb 10, 2018 at 10:28 AM, Willem Jan Withagen
Post by Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and they now fail,
since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in any near
future?
Post by Willem Jan Withagen
--WjW
No.  It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS.  Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
-Alan
Unfortunately, posix documents that the function returns EINVAL only
due to bad input parameters, so ignoring that seems like a bad idea.
Wouldn't it be better if we returned EOPNOTSUP if that's the actual
situation?  That could be safely ignored.
I'm afraid you are mistaken.  Posix _should've_ required EOPNOTSUP in this,
but it actually requires EINVAL.
http://pubs.opengroup.org/onlinepubs/9699919799/functions/posix_fallocate.html
Oops, I apparently was looking at the prior version of the spec.
 Nevermind. :)

-- Ian
Willem Jan Withagen
2018-02-10 22:50:01 UTC
Permalink
Post by Ian Lepore
On Sat, Feb 10, 2018 at 10:28 AM, Willem Jan Withagen
Post by Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and they now fail,
since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in any near future?
--WjW
No.  It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS.  Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
-Alan
Unfortunately, posix documents that the function returns EINVAL only
due to bad input parameters, so ignoring that seems like a bad idea.
Wouldn't it be better if we returned EOPNOTSUP if that's the actual
situation?  That could be safely ignored.
I would probably help in my situation....

And I've been looking at the manpage, but cannot seem to find any
indication that EINVAL is returned on running it on FreeBSD.

--WjW
Alan Somers
2018-02-10 23:10:44 UTC
Permalink
Post by Willem Jan Withagen
Post by Ian Lepore
On Sat, Feb 10, 2018 at 10:28 AM, Willem Jan Withagen
Post by Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and they now fail,
since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in any near future?
--WjW
No. It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS. Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
-Alan
Unfortunately, posix documents that the function returns EINVAL only
due to bad input parameters, so ignoring that seems like a bad idea.
Wouldn't it be better if we returned EOPNOTSUP if that's the actual
situation? That could be safely ignored.
I would probably help in my situation....
And I've been looking at the manpage, but cannot seem to find any
indication that EINVAL is returned on running it on FreeBSD.
It's in the manpage, but only on head. It hasn't been in any stable
release yet.

https://svnweb.freebsd.org/base/head/lib/libc/sys/posix_fallocate.2?revision=325422&view=markup#l112
Willem Jan Withagen
2018-02-10 23:20:20 UTC
Permalink
On Sat, Feb 10, 2018 at 10:28 AM, Willem Jan Withagen
Hi,
This has been disabled on ZFS since last November.
And I do understand the rationale on this.
BUT
I've now upgraded some of my HEAD Ceph test systems and
they now fail,
since Ceph uses posix_fallocate() to allocate space for the
FileStore-journal.
Is there any expectation that this is going to fixed in
any near future?
--WjW
No.  It's fundamentally impossible to support
posix_fallocate on a COW
filesystem like ZFS.  Ceph should be taught to ignore an
EINVAL result,
since the system call is merely advisory.
-Alan
Unfortunately, posix documents that the function returns EINVAL only
due to bad input parameters, so ignoring that seems like a bad idea.
Wouldn't it be better if we returned EOPNOTSUP if that's the actual
situation?  That could be safely ignored.
I would probably help in my situation....
And I've been looking at the manpage, but cannot seem to find any
indication that EINVAL is returned on running it on FreeBSD.
It's in the manpage, but only on head.  It hasn't been in any stable
release yet.
https://svnweb.freebsd.org/base/head/lib/libc/sys/posix_fallocate.2?revision=325422&view=markup#l112
Right, it is. And it is even in the man-page were I looked. :(
Just plainly read over it.

To be honest I would expect it to have a bit more proza in the header of
the manpage. Because it is rather significant that it does not work on
certain FSes. And not just hide this in a single line in the explanation
of an error value...

--WjW
Garrett Wollman
2018-02-10 18:46:33 UTC
Permalink
In article
Post by Alan Somers
Post by Willem Jan Withagen
Is there any expectation that this is going to fixed in any near future?
No. It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS. Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
I don't think it's true that this is _fundamentally_ impossible. What
the standard requires would in essence be a per-object refreservation.
ZFS supports refreservation, obviously, but not on a per-object basis.
Furthermore, there are mechanisms to preallocate blocks for things
like dumps. So it *could* be done (as in, the concept is there), but
it may not be practical. (And ultimately, there are ways in which the
administrator might manage the system that would defeat the desired
effect, but that's out of the standard's scope.) Given the semantic
mismatch, though, I suspect it's unreasonable to expect anyone to
prioritize implementation of such a feature.

-GAWollman
John Baldwin
2018-02-12 17:04:57 UTC
Permalink
Post by Garrett Wollman
In article
Post by Alan Somers
Post by Willem Jan Withagen
Is there any expectation that this is going to fixed in any near future?
No. It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS. Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
I don't think it's true that this is _fundamentally_ impossible. What
the standard requires would in essence be a per-object refreservation.
ZFS supports refreservation, obviously, but not on a per-object basis.
Furthermore, there are mechanisms to preallocate blocks for things
like dumps. So it *could* be done (as in, the concept is there), but
it may not be practical. (And ultimately, there are ways in which the
administrator might manage the system that would defeat the desired
effect, but that's out of the standard's scope.) Given the semantic
mismatch, though, I suspect it's unreasonable to expect anyone to
prioritize implementation of such a feature.
I don't think posix_fallocate() can be compatible with COW. Suppose you
do reserve a fixed set of blocks. That ensures the first write has a
place to write, but not if you overwrite one of those blocks. You'd have
to reserve another block to maintain the reservation each time you wrote
to a block, or you'd have to have a way to mark a file as not COW. The
first case isn't really any better than not using posix_fallocate() in the
first place as you are still requiring writes to allocate blocks, and the
second seems a bit fraught with peril as well if the application is
expecting the non-COW'd file to be in sync with other files in the system
since presumably non-COW'd files couldn't be snapshotted, etc.
--
John Baldwin
Patrick Kelsey
2018-02-14 02:11:34 UTC
Permalink
Post by Willem Jan Withagen
Post by Garrett Wollman
In article
Post by Alan Somers
Post by Willem Jan Withagen
Is there any expectation that this is going to fixed in any near
future?
Post by Garrett Wollman
Post by Alan Somers
No. It's fundamentally impossible to support posix_fallocate on a COW
filesystem like ZFS. Ceph should be taught to ignore an EINVAL result,
since the system call is merely advisory.
I don't think it's true that this is _fundamentally_ impossible. What
the standard requires would in essence be a per-object refreservation.
ZFS supports refreservation, obviously, but not on a per-object basis.
Furthermore, there are mechanisms to preallocate blocks for things
like dumps. So it *could* be done (as in, the concept is there), but
it may not be practical. (And ultimately, there are ways in which the
administrator might manage the system that would defeat the desired
effect, but that's out of the standard's scope.) Given the semantic
mismatch, though, I suspect it's unreasonable to expect anyone to
prioritize implementation of such a feature.
I don't think posix_fallocate() can be compatible with COW. Suppose you
do reserve a fixed set of blocks. That ensures the first write has a
place to write, but not if you overwrite one of those blocks. You'd have
to a block, or you'd have to have a way to mark a file as not COW. The
first case isn't really any better than not using posix_fallocate() in the
first place as you are still requiring writes to allocate blocks, and the
second seems a bit fraught with peril as well if the application is
expecting the non-COW'd file to be in sync with other files in the system
since presumably non-COW'd files couldn't be snapshotted, etc.
I think Garrett's assessment that it is not fundamentally impossible, but
may not be felt to be worth implementing in any given file system for
practical reasons, is correct. I say this having designed/implemented a
COW file system that was driven by customer pressure to do things that at
first pass one might declare represented an architectural contradiction,
but upon further reflection were entirely possible to do given sufficient
willingness to invest the effort and accept the accompanying trade-offs,
additional knobs to turn, etc.

In this case (posix_fallocate() + COW + snapshots), it could be implemented
with a per-object allocator that normally keeps at least one extra block
beyond the reservation requirement on hand, plus a snapshot operation that
in order to succeed has to be able to provision the local allocators of all
fallocated objects with enough additional blocks to maintain the no-fail
write guarantee post-snapshot.

-Patrick

Loading...