Home - Waterfall Grid T-Grid Console Builders Recent Builds Buildslaves Changesources - JSON API - About

Console View


Tags: Architectures Distributions Performance Style Tests
Legend:   Passed Failed Warnings Failed Again Running Exception Offline No data

Architectures Distributions Performance Style Tests
Chengfei, Zhu
QAT: Allocate digest_buffer by QAT_PHYS_CONTIG_ALLOC() in the qat_checksum().

If the buffer 'digest_buffer' is allocated in the qat_checksum
stack, it can't ensure that the adderss is physical contiguous,
the DMA result of the buffer may have problem. The adderss
allocated by QAT_PHYS_CONTIG_ALLOC() ensure physical contiguous.

Signed-off-by: Chengfei, Zhu <chengfeix.zhu@intel.com>

Pull-request: #8521 part 1/1
Brian Behlendorf
Add missing dmu_zfetch_fini() in dnode_move_impl()

As it turns out, on the Windows platform when rw_init() is called
(rather its bedrock call ExInitializeResourceLite) it is placed on
an active-list of locks, and is removed at rw_destroy() time.

dnode_move() has logic to copy over the old-dnode to new-dnode,
including calling dmu_zfetch_init(new-dnode). But due to the missing
dmu_zfetch_fini(old-dnode), kmem will call dnode_dest() to release the
memory (and in debug builds fill pattern 0xdeadbeef) over the Windows
active-lock's prev/next list pointers, making Windows sad.

But on other platforms, the contents of dmu_zfetch_fini() is one
call to list_destroy() and one to rw_destroy(), which is effectively
a no-op call and is not required. This commit is mostly for
"correctness" and can be skipped there.

Porting Notes:
* This leak exists on Linux but currently can never happen because
  the dnode_move() functionality is not supported.

openzfsonosx-commit: openzfsonosx/zfs@d95fe517

Authored by: Julian Heuking <JulianH@beckhoff.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>

Pull-request: #8519 part 1/1
Tom Caputi
Add space in error message

This patch simply adds a missing space in the
ZFS_ERR_FROM_IVSET_GUID_MISSING error message.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8514
Brian Behlendorf
mutex leak in dsl_dataset_hold_obj()

In addition to dsl_dataset_evict_async() releasing a hold, there is
an error case in dsl_dataset_hold_obj() which had missed 4 additional
release calls.  This was introduced in a1d477c24.

openzfsonosx-commit: https://github.com/openzfsonosx/zfs/commit/63ff7f1c

Authored by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Pull-request: #8517 part 1/1
Paul Dagnelie
Reserve fields and values for redacted send/recv

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Pull-request: #8516 part 1/1
Tom Caputi
Add space in error message.

This patch simply adds a missing space in the
ZFS_ERR_FROM_IVSET_GUID_MISSING error message.

Signed-off-by: Tom Caputi <tcaputi@datto.com>

Pull-request: #8514 part 1/1
Chris Wedgwood
systemd: correct path for awk (to /usr/bin/awk)

The systemd scripts use awk bit /bin/awk which doesn't exist on most
systems which use /usr/bin/awk (checked on Debian, Ubuntu, CentOS,
Alpine).

Signed-off-by: Chris Wedgwood <cw@f00f.org>

Pull-request: #8513 part 1/1
Michael Niewöhner
Move dracut specifics to dracut mount unit

Dracut depends on BOOTFS environment variable to be set after pool
import. This dracut specific systemd ExecStartPost command should not be
called for any non-dracut systems, so let's move it to a static systemd
unit that gets pulled in by the generated mount unit.

Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>

Pull-request: #8510 part 3/3
Michael Niewöhner
Fix systemd-import services

On debian systemd complains about missing /bin/awk because it actually
is located at /usr/bin/awk. It is not a good idea to hardcode binary
paths because linux distros use different paths. According to systemd's
man page it is absolutely safe to miss paths for binary located at
standard locations (/bin, /sbin, /usr/bin, ...).

Further, replace this more or less complicated awk command by egrep.

Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>

Pull-request: #8510 part 2/3
Michael Niewöhner
Remove hard dependency on bash

zfs-import-* services have a hard dependency on bash while not everyone
has bash installed. At this point /bin/sh is sufficient, so use that.

Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>

Pull-request: #8510 part 1/3
Michael Niewöhner
Move dracut specifics to dracut mount unit

Dracut depends on BOOTFS environment variable to be set after pool
import. This dracut specific systemd ExecStartPost command should not be
called for any non-dracut systems, so let's move it to a static systemd
unit that gets pulled in by the generated mount unit.

Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>

Pull-request: #8510 part 3/3
Michael Niewöhner
Fix systemd-import services

On debian systemd complains about missing /bin/awk because it actually
is located at /usr/bin/awk. It is not a good idea to hardcode binary
paths because linux distros use different paths. According to systemd's
man page it is absolutely safe to miss paths for binary located at
standard locations (/bin, /sbin, /usr/bin, ...).

Further, replace this more or less complicated awk command by egrep.

Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>

Pull-request: #8510 part 2/3
Michael Niewöhner
Remove hard dependency on bash

zfs-import-* services have a hard dependency on bash while not everyone
has bash installed. At this point /bin/sh is sufficient, so use that.

Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>

Pull-request: #8510 part 1/3
Michael Niewöhner
Move dracut specifics to dracut mount unit

Dracut depends on BOOTFS environment variable to be set after pool
import. This dracut specific systemd ExecStartPost command should not be
called for any non-dracut systems, so let's move it to a static systemd
unit that gets pulled in by the generated mount unit.

Pull-request: #8510 part 3/3
  • Debian 8 arm (BUILD): cloning zfs -  stdio
  • Debian 8 ppc (BUILD): cloning zfs -  stdio
  • CentOS 7 x86_64 (BUILD): cloning zfs -  stdio
  • Ubuntu 16.04 x86_64 (BUILD): cloning zfs -  stdio
  • Ubuntu 17.10 x86_64 (STYLE): cloning zfs -  stdio
Michael Niewöhner
Fix systemd-import services

On debian systemd complains about missing /bin/awk because it actually
is located at /usr/bin/awk. It is not a good idea to hardcode binary
paths because linux distros use different paths. According to systemd's
man page it is absolutely safe to miss paths for binary located at
standard locations (/bin, /sbin, /usr/bin, ...).

Further, replace this more or less complicated awk command by egrep.

Pull-request: #8510 part 2/3
  • Amazon 2 x86_64 (BUILD): cloning zfs -  stdio
  • Debian 8 ppc64 (BUILD): cloning zfs -  stdio
  • Ubuntu 16.04 aarch64 (BUILD): cloning zfs -  stdio
  • CentOS 6 x86_64 (BUILD): cloning zfs -  stdio
  • CentOS 7 x86_64 (BUILD): cloning zfs -  stdio
  • Debian 9 x86_64 (BUILD): cloning zfs -  stdio
  • Fedora 29 x86_64 (BUILD): cloning zfs -  stdio
  • Kernel.org Built-in x86_64 (BUILD): cloning zfs -  stdio
  • Ubuntu 16.04 x86_64 (BUILD): cloning zfs -  stdio
  • Ubuntu 18.04 x86_64 (BUILD): cloning zfs -  stdio
  • Ubuntu 17.10 x86_64 (STYLE): cloning zfs -  stdio
Michael Niewöhner
Remove hard dependency on bash

zfs-import-* services have a hard dependency on bash while not everyone
has bash installed. At this point /bin/sh is sufficient, so use that.

Pull-request: #8510 part 1/3
  • Debian 8 arm (BUILD): cloning zfs -  stdio
  • Debian 8 ppc (BUILD): cloning zfs -  stdio
  • Ubuntu 16.04 i386 (BUILD): cloning zfs -  stdio
  • Debian 9 x86_64 (BUILD): cloning zfs -  stdio
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507
Signed-off-by: Richard Yao <ryao@gentoo.org>

Pull-request: #8509 part 1/1
Richard Yao
Drop statx support for reporting file attribtes

This is to test a conjecture that reporting file attribute data through
statx alters the behavior of lsattr on Fedora in a way that breaks
project support. This commit is intended to let the buildbot do some
checks before I have a chance to look more deeply.

Signed-off-by: Richard Yao <ryao@gentoo.org>

Pull-request: #8509 part 2/2
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507
Signed-off-by: Richard Yao <ryao@gentoo.org>

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Richard Yao
Linux 4.11 compat: statx support

Linux 4.11 added a new statx system call that allows us to expose crtime
as btime. We do this by caching crtime in the znode to match how atime,
ctime and mtime are cached in the inode.

statx also introduced a new way of reporting whether the immutable,
append and nodump bits have been set. It adds support for reporting
compression and encryption, but the semantics on other filesystems is
not just to report compression/encryption, but to allow it to be turned
on/off at the file level. We do not support that.

We could implement semantics where we refuse to allow user modification
of the bit, but we would need to do a dnode_hold() in zfs_znode_alloc()
to find out encryption/compression information. That would introduce
locking that will have a minor (although unmeasured) performance cost.
It also would be inferior to zdb, which reports far more detailed
information. We therefore omit reporting of encryption/compression
through statx in favor of recommending that users interested in such
information use zdb.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #8507

Pull-request: #8509 part 1/1
Brian Behlendorf
Report holes when there are only metadata changes

Update the dirty check in dmu_offset_next() such that dnode's
are only considered dirty for the purpose or reporting holes
when there are pending data blocks or frees to be synced.  This
ensures that when there are only metadata updates to be synced
(atime) that holes are reported.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6958

Pull-request: #8505 part 1/1
Brian Behlendorf
Improve `zpool labelclear`

1) As implemented the `zpool labelclear` command overwrites
the calculated offsets of all four vdev labels even when only a
single valid label is found.  If the device as been repurposed
but still contains a valid label this can result in space no
longer owned by ZFS being zeroed.  Prevent this by verifing
every label removed is intact before it's overwritten.

2) Address a small bug in zpool_do_labelclear() which prevented
labelclear from working on file vdevs.  Only block devices support
BLKFLSBUF, try the ioctl() but when it's reported as unsupported
this should not be fatal.

3) Fix `zpool labelclear` so it can be run on vdevs which were
removed from the pool with `zpool remove`.  Additionally, allow
intact but partial labels to be cleared as in the case of a failed
`zpool attach` or `zpool replace`.

4) Remove LABELCLEAR and LABELREAD variables for test cases.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6261

Pull-request: #8500 part 1/1
Brian Behlendorf
Improve `zpool labelclear`

1) As implemented the `zpool labelclear` command overwrites
the calculated offsets of all four vdev labels even when only a
single valid label is found.  If the device as been repurposed
but still contains a valid label this can result in space no
longer owned by ZFS being zeroed.  Prevent this by verifing
every label removed is intact before it's overwritten.

2) Address a small bug in zpool_do_labelclear() which prevented
labelclear from working on file vdevs.  Only block devices support
BLKFLSBUF, try the ioctl() but when it's reported as unsupported
this should not be fatal.

3) Fix `zpool labelclear` so it can be run on vdevs which were
removed from the pool with `zpool remove`.  Additionally, allow
intact but partial labels to be cleared as in the case of a failed
`zpool attach` or `zpool replace`.

4) Remove LABELCLEAR and LABELREAD variables for test cases.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6261

Pull-request: #8500 part 1/1
Serapheim Dimitropoulos
Log Spacemap Project

= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry refering to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547

Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834

Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3

Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a

Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55

zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89

vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f

Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb

Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f

Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227,
DLPX-59320

Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>

Pull-request: #8442 part 1/1
Brian Behlendorf
Review Feedback

* Updated comments.
* Converted ZIO_PRIORITY_TRIM to ASSERTs in vdev_queue_io()

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Pull-request: #8419 part 2/2
Brian Behlendorf
Add UNMAP/TRIM functionality to ZFS

UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more effeciently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accomidate this by adding a new
ZIO_TYPE_TRIM type and assoicated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This made it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on' ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Pull-request: #8419 part 1/2
Olaf Faaland
MMP interval and fail_intervals in uberblock

When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>

Pull-request: #7842 part 1/1
Olaf Faaland
MMP interval and fail_intervals in uberblock

When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>

Pull-request: #7842 part 1/1
Olaf Faaland
MMP interval and fail_intervals in uberblock

When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>

Pull-request: #7842 part 1/1
Olaf Faaland
MMP interval and fail_intervals in uberblock

When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>

Pull-request: #7842 part 1/1