Home - Waterfall Grid T-Grid Console Builders Recent Builds Buildslaves Changesources - JSON API - About

Console View


Tags: Architectures Platforms default
Legend:   Passed Failed Warnings Failed Again Running Exception Offline No data

Architectures Platforms default
Crag Wang
pam_zfs_key: support keyfile for filesystem mount in sm_open_session

Use pam's pw_get as the key to decrypt the filesystem if keylocation is
prompt, otherwise use zfs_crypto_load_key to verify and parse the
keylocation in order to get the key loaded.

If the authentication token is being changed for a user, the dataset in
relation to that user will not always do the change accordingly. Only
conditionally do if 'prompt' in use for the keylocation.

Signed-off-by: Crag Wang <crag0715@gmail.com>

Pull-request: #11247 part 1/1
Đoàn Trần Công Danh
dracut: use /bin/sh instead of bash as the intepreter

Despite that dracut has a hard dependency on bash,
its modules doesn't, dracut only has a hard dependency on bash for
module-setup (on a fully usable machine). Inside initramfs, dracut
allows users choose from a list of handful other shells, e.g. bash,
busybox, dash, mkfsh.

In fact, my local machine's initramfs is being built with dash,
and it's functional for a very long time.

Before 64025fa3a (Silence 'make checkbashisms', 2020-08-20), we also
allows our users to have that right, too.

Let's fix the problem 'make checkbashisms' reported and allows our users
to have that right, again.

For 'plymouth' case, let's simply run the command inside the if instead
of checking for the existence of command before running it, because the
status is also failture if plymouth is unavailable.

While we're at it, let's remove an unnecessary fork for grep in
zfs-generator.sh.in and its following complicated 'if elif fi' with
a simple 'case ... esac'.

To support this change, also exclude 90zfs from "make checkbashisms"
because the current CI infrastructure ships an old version of
"checkbashisms", which complains about "command -v", while the current
latest "checkbashisms" thinks it's fine. In the near future, we can
revert that change to "Makefile.am" when CI infrastructure is updated.

Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>

Pull-request: #11244 part 1/1
Đoàn Trần Công Danh
dracut: use /bin/sh instead of bash as the intepreter

Despite that dracut has a hard dependency on bash,
its modules doesn't, dracut only has a hard dependency on bash for
module-setup (on a fully usable machine). Inside initramfs, dracut
allows users choose from a list of handful other shells, e.g. bash,
busybox, dash, mkfsh.

In fact, my local machine's initramfs is being built with dash,
and it's functional for a very long time.

Before 64025fa3a (Silence 'make checkbashisms', 2020-08-20), we also
allows our users to have that right, too.

Let's fix the problem 'make checkbashisms' reported and allows our users
to have that right, again.

For 'plymouth' case, let's simply run the command inside the if instead
of checking for the existence of command before running it, because the
status is also failture if plymouth is unavailable.

While we're at it, let's remove an unnecessary fork for grep in
zfs-generator.sh.in and its following complicated 'if elif fi' with
a simple 'case ... esac'.

Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>

Pull-request: #11244 part 1/1
Đoàn Trần Công Danh
dracut: use /bin/sh instead of bash as the intepreter

Despite that dracut has a hard dependency on bash,
its modules doesn't, dracut only has a hard dependency on bash for
module-setup (on a fully usable machine). Inside initramfs, dracut
allows users choose from a list of handful other shells, e.g. bash,
busybox, dash, mkfsh.

In fact, my local machine's initramfs is being built with dash,
and it's functional for a very long time.

Before 64025fa3a (Silence 'make checkbashisms', 2020-08-20), we also
allows our users to have that right, too.

Let's fix the problem 'make checkbashisms' reported and allows our users
to have that right, again.

For 'plymouth' case, let's simply run the command inside the if instead
of checking for the existence of command before running it, because the status is also failture if plymouth is unavailable.

While we're at it, let's remove an unnecessary fork for grep in
zfs-generator.sh.in and its following complicated 'if elif fi' with
a simple 'case ... esac'.

Signed-off-by: Đoàn Trần Công Danh <congdanhqx@gmail.com>

Pull-request: #11244 part 1/1
  • Amazon 2 x86_64 (BUILD): cloning zfs -  stdio
  • Debian 10 arm64 (BUILD): cloning zfs -  stdio
  • Ubuntu 16.04 i386 (BUILD): cloning zfs -  stdio
  • Kernel.org Built-in x86_64 (BUILD): cloning zfs -  stdio
Brian Behlendorf
Verify zfs module loaded before starting services

Extend the change made in ae12b02 to verify the zfs kernel
modules are loaded to the rest of the OpenZFS services.  If
the modules aren't loaded the neither the share, volume, or
and zed services can be started.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Pull-request: #11243 part 1/1
Brian Behlendorf
Tag 2.0.0-rc7

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Alexander Motin
Reduce latency effects of non-interactive I/O

Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
- by using _min_active queue depths for non-interactive requests while
  the interactive request(s) are active and few requests after;
- by throttling it further if no interactive requests has completed
  while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,  70, 107,  45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11166
cragw
pam_zfs_key: accommodate different dataset naming scheme

Name of dataset for user home directory may vary from the expected
$homes_prefix/$username, if different naming scheme is being used.

We can use property mountpoint to specify the dataset for $username
as long as its value is identical to passwd's pw_dir.

For example:
    NAME                      PROPERTY    VALUE
    rpool/home/myuser_123456  mountpoint  /home/myuser

Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Crag Wang <crag0715@gmail.com>
Closes #11165
Matthew Macy
FreeBSD: decouple ZFS_DEBUG from kernel debug settings

Reviewed-by: Martelli Nikola @martellini
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11213
Brian Behlendorf
Obsolete earlier packages due to version bump

In order for package managers such as dnf to upgrade cleanly after
the package SONAME bump the obsolete package names must be known.
Update the new packages to correctly obsolete the old ones.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11230
Closes #11233
Antonio Russo
libzfsbootenv: do not depend on libnvpair

We do not build libnvpair.pc.  Moreover, it is automatically pulled in
by libzfs.pc, so no additional specific dependency is required.

Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <aerusso@aerusso.net>
Closes #11227
Ryan Moeller
Avoid extra work updating ARC kstats and tunables

After e357046 it should not be necessary to periodically update ARC
kstats and tunables.

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

Pull-request: #11237 part 1/1
Brian Behlendorf
Remove incorrect assertion

Commit 85703f6 added a new ASSERT to zfs_write() as part of the
cleanup which isn't correct in the case where multiple processes
are concurrently extending a file.  The `zp->z_size` is updated
atomically while holding a range lock on only a portion of the
file.  Therefore, it's possible for the file size to increase
after a same check is performed earlier in the loop causing this
ASSERT to fail.  The code itself handles this case correctly so
only the invalid ASSERT needs to be removed.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11235
Alexander Motin
Reduce latency effects of non-interactive I/O

Investigating influence of scrub (especially sequential) on random read
latency I've noticed that on some HDDs single 4KB read may take up to 4
seconds!  Deeper investigation shown that many HDDs heavily prioritize
sequential reads even when those are submitted with queue depth of 1.

This patch addresses the latency from two sides:
- by using _min_active queue depths for non-interactive requests while
  the interactive request(s) are active and few requests after;
- by throttling it further if no interactive requests has completed
  while configured amount of non-interactive did.

While there, I've also modified vdev_queue_class_to_issue() to give
more chances to schedule at least _min_active requests to the lowest
priorities.  It should reduce starvation if several non-interactive
processes are running same time with some interactive and I think should
make possible setting of zfs_vdev_max_active to as low as 1.

I've benchmarked this change with 4KB random reads from ZVOL with 16KB
block size on newly written non-fragmented pool.  On fragmented pool I
also saw improvements, but not so dramatic.  Below are log2 histograms
of the random read latency in milliseconds for different devices:

4 2x mirror vdevs of SATA HDD WDC WD20EFRX-68EUZN0 before:
0, 0, 2,  1,  12,  21,  19,  18, 10, 15, 17, 21
after:
0, 0, 0, 24, 101, 195, 419, 250, 47,  4,  0,  0
, that means maximum latency reduction from 2s to 500ms.

4 2x mirror vdevs of SATA HDD WDC WD80EFZX-68UW8N0 before:
0, 0,  2,  31,  38,  28,  18,  12, 17, 20, 24, 10, 3
after:
0, 0, 55, 247, 455, 470, 412, 181, 36,  0,  0,  0, 0
, i.e. from 4s to 250ms.

1 SAS HDD SEAGATE ST14000NM0048 before:
0,  0,  29,  70, 107,  45,  27, 1, 0, 0, 1, 4, 19
after:
1, 29, 681, 1261, 676, 1633,  67, 1, 0, 0, 0, 0,  0
, i.e. from 4s to 125ms.

1 SAS SSD SEAGATE XS3840TE70014 before (microseconds):
0, 0, 0, 0, 0, 0, 0, 0,  70, 18343, 82548, 618
after:
0, 0, 0, 0, 0, 0, 0, 0, 283, 92351, 34844,  90

I've also measured scrub time during the test and on idle pools.  On
idle fragmented pool I've measured scrub getting few percent faster
due to use of QD3 instead of QD2 before.  On idle non-fragmented pool
I've measured no difference.  On busy non-fragmented pool I've measured
scrub time increase about 1.5-1.7x, while IOPS increase reached 5-9x.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes #11166
Brian Behlendorf
Obsolete earlier packages due to version bump

In order for package managers such as dnf to upgrade cleanly after
the package SONAME bump the obsolete package names must be known.
Update the new packages to correctly obsolete the old ones.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11230
Closes #11233
Matthew Macy
FreeBSD: decouple ZFS_DEBUG from kernel debug settings

Reviewed-by: Martelli Nikola @martellini
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes #11213
Brian Behlendorf
Update dRAID short feature description

The documentation describes dRAID as a distributed spare, not
parity, RAID implementation.  Update the short feature description
to match the rest of the documentation.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #11229
Brian Behlendorf
Remove incorrect assertion

Commit 85703f6 added a new ASSERT to zfs_write() as part of the
cleanup which isn't correct in the case where multiple processes
are concurrently extending a file.  The `zp->z_size` is updated
atomically while holding a range lock on only a portion of the
file.  Therefore, it's possible for the file size to increase
after a same check is performed earlier in the loop causing this
ASSERT to fail.  The code itself handles this case correctly so
only the invalid ASSERT needs to be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Pull-request: #11235 part 1/1
Brian Behlendorf
Obsolete earlier packages due to version bump

In order for package managers such as dnf to upgrade cleanly after
the package SONAME bump the obsolete package names must be known.
Update the new packages to correctly obsolete the old ones.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Pull-request: #11233 part 1/1
Matt Macy
Decouple arc_read_done callback from arc buf instantiation

- Add ARC_FLAG_CALLBACK_ONLY to indicate that a buffer need
  not be instantiated.

This should fix #11220:
~20% performance regression on cached reads due to zfetch changes

Signed-off-by: Matt Macy <mmacy@FreeBSD.org>

Pull-request: #11232 part 1/1
Matt Macy
Decouple arc_read_done callback from arc buf instantiation

- Add ARC_FLAG_CALLBACK_ONLY to indicate that a buffer need
  not be instantiated.

This should fix #11220:
~20% performance regression on cached reads due to zfetch changes

Signed-off-by: Matt Macy <mmacy@FreeBSD.org>

Pull-request: #11232 part 1/1
Richard Laager
Raise raidz mindev requirement

Previously, the requirement was nparity + 1.  It is now nparity + 2.

There is no point in creating a two-disk raidz1; one should create a
two-disk mirror instead.  Likewise, a three-disk raidz2 should be a
three-way mirror and a four-disk raidz3 should be a four-disk mirror (or
do something else!).

This change helps prevent people from shooting themselves in the foot.

Signed-off-by: Richard Laager <rlaager@wiktel.com>

Pull-request: #11231 part 1/1
Pavel Snajdr
Resolve path issue

- Move the zpool_influxdb command to /usr/libexec/zfs.
  Packages and generally expected to use a subdirectory.

- Include the /usr/libexec/zfs path in the system search
  directory when running the test suite.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>

Pull-request: #11224 part 2/2
Pavel Snajdr
zpool_influxdb: move to libexec dir

Move zpool_influxdb to libexec dir, as per following discussions:

https://github.com/openzfs/zfs/issues/11156
https://github.com/openzfs/zfs/pull/11160

Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>

Pull-request: #11224 part 1/2
George Amanakis
Fix raw sends on encrypted datasets when copying back snapshots

When sending raw encrypted datasets the user space accounting is present
when it's not expected to be. This leads to the subsequent mount failure
due a checksum error when verifying the local mac.
Fix this by clearing the OBJSET_FLAG_USERACCOUNTING_COMPLETE and reset
the local mac. This allows the user accounting to be correctly updated
on first mount using the normal upgrade process.

Closes #10523.

Signed-off-by: George Amanakis <gamanakis@gmail.com>

Pull-request: #11221 part 1/1
Andriy Gapon
remove overzealous assert

*offset can become less than start not only because of a wrap-around,
but also because of shifting right, doing no iterations and then
shifting back.

Signed-off-by: Andriy Gapon <avg@FreeBSD.org>

Pull-request: #11200 part 3/3
Andriy Gapon
dnode_next_offset: do not ascend needlessly

Do that by limiting the maximum level to a level that covers all
possible offsets.  Going above that level is bound to be fruitless.
That's not a problem by itself, but it breaks an assumption in the
proposed fix to dnode_next_offset that an unsuccessful descent can
happen only when txg parameter is not zero.

This change reworks a change to dnode_next_offset_level made in
commit 031d7c2fe6a.  After this change the span can never be too
large for a shift.

The described situation can happen with a (meta-)dnode like this:

    Object  lvl  iblk  dblk  dsize  dnsize  lsize  %full  type
        0    6  128K    16K  575K    512  608K    3.78  DMU dnode

That is, dn_datablkshift = 14, dn_indblkshift = 17, dn_nlevels = 6
and 64-bit offsets.  Level 5 already covers all possible disk offsets,
but dnode_next_offset() would still call dnode_next_offset_level(lvl=6)
to examine the dnode's dn_blkptr.  That would lead to the described
issue.

Signed-off-by: Andriy Gapon <avg@FreeBSD.org>

Pull-request: #11200 part 2/3
Andriy Gapon
fix a subtle bug in dnode_next_offset() with txg > 0

Only stop the search when the upward search cannot find a matching
block pointer in the requested direction from the current offset.
That definitely means that the search is exhausted.

But if teh downward search cannot find a suitable block at the
requested level, then cycle back to the upward search for farther
offsets.

Signed-off-by: Andriy Gapon <avg@FreeBSD.org>

Pull-request: #11200 part 1/3
Andriy Gapon
dnode_next_offset: do not ascend needlessly

Do that by limiting the maximum level to a level that covers all
possible offsets.  Going above that level is bound to be fruitless.
That's not a problem by itself, but it breaks an assumption in the
proposed fix to dnode_next_offset that an unsuccessful descent can
happen only when txg parameter is not zero.

This change reworks a change to dnode_next_offset_level made in
commit 031d7c2fe6a.  After this change the span can never be too
large for a shift.

The described situation can happen with a (meta-)dnode like this:

    Object  lvl  iblk  dblk  dsize  dnsize  lsize  %full  type
        0    6  128K    16K  575K    512  608K    3.78  DMU dnode

That is, dn_datablkshift = 14, dn_indblkshift = 17, dn_nlevels = 6
and 64-bit offsets.  Level 5 already covers all possible disk offsets,
but dnode_next_offset() would still call dnode_next_offset_level(lvl=6)
to examine the dnode's dn_blkptr.  That would lead to the described
issue.

Signed-off-by: Andriy Gapon <avg@FreeBSD.org>

Pull-request: #11200 part 2/2
Andriy Gapon
fix a subtle bug in dnode_next_offset() with txg > 0

Only stop the search when the upward search cannot find a matching
block pointer in the requested direction from the current offset.
That definitely means that the search is exhausted.

But if teh downward search cannot find a suitable block at the
requested level, then cycle back to the upward search for farther
offsets.

Signed-off-by: Andriy Gapon <avg@FreeBSD.org>

Pull-request: #11200 part 1/2
Ryan Moeller
ZTS: Fix incorrect use of libtest in user_run by xattr_003_neg

You can't use user_run to eval ksh functions defined in libtest unless
you include libtest in the user shell.

Simplify user_run to retain the current environment, eliminate eval,
and feed the command string into ksh.  Enhance the logging for
user_run so we can see out and err.

Fix xattr_003_neg by:
* running ksh as the user
* feeding it the commands to include libtest *then* run get_xattr
* assert this fails
* use variables for filenames so they don't change in the user's shell
* don't log the contents of /etc/passwd
* cleanup all byproducts

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

Pull-request: #11185 part 1/1
Ryan Moeller
ZTS: Fix incorrect use of libtest in user_run by xattr_003_neg

You can't use user_run to eval ksh functions defined in libtest unless
you include libtest in the user shell.

Simplify user_run to retain the current environment, eliminate eval,
and feed the command string into ksh.  Enhance the logging for
user_run so we can see out and err.

Fix xattr_003_neg by:
* running ksh as the user
* feeding it the commands to include libtest *then* run get_xattr
* assert this fails
* use variables for filenames so they don't change in the user's shell
* don't log the contents of /etc/passwd
* cleanup all byproducts

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

Pull-request: #11185 part 2/2
Ryan Moeller
[FreeBSD] decouple ZFS_DEBUG from kernel debug settings

Signed-off-by: Matt Macy <mmacy@FreeBSD.org>

Pull-request: #11185 part 1/2
Ryan Moeller
ZTS: Fix incorrect use of libtest in user_run by xattr_003_neg

You can't use user_run to eval ksh functions defined in libtest unless
you include libtest in the user shell.

Simplify user_run to retain the current environment, eliminate eval,
and feed the command string into ksh.  Enhance the logging for
user_run so we can see out and err.

Fix xattr_003_neg by:
* running ksh as the user
* feeding it the commands to include libtest *then* run get_xattr
* assert this fails
* use variables for filenames so they don't change in the user's shell
* don't log the contents of /etc/passwd
* cleanup all byproducts

Signed-off-by: Ryan Moeller <ryan@iXsystems.com>

Pull-request: #11185 part 1/1
Brian Atkinson
WIP Direct IO

Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned.

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

To ensure data integrity for all data written using O_DIRECT, all user
pages are made stable in the event one of the following is required:
Checksum
Compression
Encryption
Parity
By making the user pages stable, we make sure the contents of the user
provided buffer can not be changed after any of the above operations
have taken place.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
  the request as a buffered IO request.
default  - Follows the alignment restrictions  outlined above for
  write/read IO requests when the O_DIRECT flag is used.
always  - Treats every write/read IO request as though it passed
          O_DIRECT and follows the alignment restrictions outlined
  above.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 2/2
Brian Atkinson
Reorganizing DMU Code

Reorganizing code in dmu.c so that the Direct IO patches will only
show the additions to dmu.c instead of also showing functions that
have been moved.

The code had to be reorganized a bit in order for the Direct IO code
paths in dmu.c to call some of the original DMU API.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Authored-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 1/2
Brian Atkinson
WIP Direct IO

Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned.

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

To ensure data integrity for all data written using O_DIRECT, all user
pages are made stable in the event one of the following is required:
Checksum
Compression
Encryption
Parity
By making the user pages stable, we make sure the contents of the user
provided buffer can not be changed after any of the above operations
have taken place.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
  the request as a buffered IO request.
default  - Follows the alignment restrictions  outlined above for
  write/read IO requests when the O_DIRECT flag is used.
always  - Treats every write/read IO request as though it passed
          O_DIRECT and follows the alignment restirctions outlined
  above.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 2/2
Brian Atkinson
Reorganizing DMU Code

Reorganizing code in dmu.c so that the Direct IO patches will only
show the additions to dmu.c instead of also showing functions that
have been moved.

The code had to be reorganized a bit in order for the Direct IO code
paths in dmu.c to call some of the original DMU API.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Authored-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 1/2
Brian Atkinson
WIP Direct IO

Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned.

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

To ensure data integrity for all data written using O_DIRECT, all user
pages are made stable in the event one of the following is required:
Checksum
Compression
Encryption
Parity
By making the user pages stable, we make sure the contents of the user
provided buffer can not be changed after any of the above operations
have taken place.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
  the request as a buffered IO request.
default  - Follows the alignment restrictions  outlined above for
  write/read IO requests when the O_DIRECT flag is used.
always  - Treats every write/read IO request as though it passed
          O_DIRECT and follows the alignment restirctions outlined
  above.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mmaybee@cray.com>
Co-authored-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 2/2
  • Debian 8 arm (BUILD): cloning zfs -  stdio
Brian Atkinson
Reorganizing DMU Code

Reorganizing code in dmu.c so that the Direct IO patches will only
show the additions to dmu.c instead of also showing functions that
have been moved.

The code had to be reorganized a bit in order for the Direct IO code
paths in dmu.c to call some of the original DMU API.

Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Authored-by: Brian Atkinson <batkinson@lanl.gov>

Pull-request: #10018 part 1/2