Home - Waterfall Grid T-Grid Console Builders Recent Builds Buildslaves Changesources - JSON API - About

Console View


Tags: Architectures Platforms default
Legend:   Passed Failed Warnings Failed Again Running Exception Offline No data

Architectures Platforms default
Rich Ercolani
Add another bdev_io_acct case for Linux 6.3+ compat

Linux 6.3+, and backports from it, changed the signatures
on bdev_io_{start,end}_acct.

Add a case for it.

Fixes: #14658

Signed-off-by: Rich Ercolani <rincebrain@gmail.com>

Pull-request: #14668 part 1/1
Ameer Hamza
zed: add hotplug support for spare vdevs

This commit supports for spare vdev hotplug. The
spare vdev associated with all the pools will be
marked as "Removed" when the driveĀ is physically
detached and will become "Available" when the
drive is reattached. Currently, the spare vdev
status does not change on the drive removal and
the same is the case with reattachment.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14295

Pull-request: #14667 part 3/3
Ameer Hamza
zed: post a udev change event from spa_vdev_attach()

In order for zed to process the removal event correctly,
udev change event needs to be posted to sync the blkid
information. spa_create() and spa_config_update() posts
the event already through spa_write_cachefile(). Doing
the same for spa_vdev_attach() that handles the case
for vdev attachment and replacement.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14172

Pull-request: #14667 part 2/3
Ameer Hamza
zed: mark disks as REMOVED when they are removed

ZED does not take any action for disk removal events if there is no
spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs
and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED
on removal event.  This means that if you are running zed and
remove a disk, it will be propertly marked as REMOVED.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Pull-request: #14667 part 1/3
Rob Norris
man: add ZIO_STAGE_BRT_FREE to zpool-events

And bump all the values after it, matching the header update in
67a1b037.

Signed-off-by: Rob Norris <robn@despairlabs.com>

Pull-request: #14665 part 1/1
Rob Norris
man: add ZIO_STAGE_BRT_FREE to zpool-events

And bump all the values after it, matching the header update in 67a1b037.

Signed-off-by: Rob Norris <robn@despairlabs.com>

Pull-request: #14665 part 1/1
Alexander Motin
Improve arc_read() error reporting

Debugging reported NULL de-reference panic in dnode_hold_impl() I found
that for certain types of errors arc_read() may only return error code,
but not properly report it via done and pio arguments.  Lack of done
calls may result in reference and/or memory leaks in higher level code.
Lack of error reporting via pio may result in unnoticed errors there.
For example, dbuf_read(), where dbuf_read_impl() ignores arc_read()
return, relies completely on the pio mechanism and missed the errors.

This patch makes arc_read() to always call done callback and always
propagate errors to parent zio, if either is provided.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.

Pull-request: #14663 part 1/1
Herbert Wartens
Allow MMP to bypass waiting for other threads

At our site we have seen cases when multi-modifier protection is enabled
(multihost=on) on our pool and the pool gets suspended due to a single
disk that is failing and responding very slowly. Our pools have 90 disks
in them and we expect disks to fail. The current version of MMP requires
that we wait for other writers before moving on. When a disk is
responding very slowly, we observed that waiting here was bad enough to
cause the pool to suspend. This change allows the MMP thread to bypass
waiting for other threads and reduces the chances the pool gets
suspended.

Signed-off-by: Herb Wartens <hawartens@gmail.com>

Pull-request: #14659 part 1/1
GitHub
Merge branch 'openzfs:master' into mmp-bypass-wait

Pull-request: #14657 part 2/2
Herbert Wartens
Allow MMP to bypass waiting for other threads

At our site we have seen cases when multi-modifier protection is enabled
(multihost=on) on our pool and the pool gets suspended due to a single
disk that is failing and responding very slowly. Our pools have 90 disks
in them and we expect disks to fail. The current version of MMP requires
that we wait for other writers before moving on. When a disk is
responding very slowly, we observed that waiting here was bad enough to
cause the pool to suspend. This change allows the MMP thread to bypass
waiting for other threads and reduces the chances the pool gets
suspended.

Signed-off-by: Herb Wartens <hawartens@gmail.com>

Pull-request: #14657 part 1/1
Pawel Jakub Dawidek
Fix build on FreeBSD.

Constify some variables after d1807f168edd09ca26a5a0c6b570686b982808ad.

Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14656 part 1/1
Pawel Jakub Dawidek
Protect db_buf with db_mtx.

It is safe to call arc_buf_destroy() with db_mtx held.

Pointed out by: amotin
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 2/2
Pawel Jakub Dawidek
Fix cloning into already dirty dbufs.

Undirty the dbuf and destroy its buffer when cloning into it.

Reported by: Richard Yao
Reported by: Benjamin Coddington
Coverity ID: CID-1535375
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 1/2
Pawel Jakub Dawidek
Retry if we find dbuf dirty again.

Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 3/3
Pawel Jakub Dawidek
Protect db_buf with db_mtx.

It is safe to call arc_buf_destroy() with db_mtx held.

Pointed out by: amotin
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 2/3
Pawel Jakub Dawidek
Fix cloning into already dirty dbufs.

Undirty the dbuf and destroy its buffer when cloning into it.

Reported by: Richard Yao
Reported by: Benjamin Coddington
Coverity ID: CID-1535375
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 1/3
Pawel Jakub Dawidek
Retry if we find dbuf dirty again.

Pull-request: #14655 part 3/3
Pawel Jakub Dawidek
Protect db_buf with db_mtx.

It is safe to call arc_buf_destroy() with db_mtx held.

Pointed out by: amotin
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 2/3
Pawel Jakub Dawidek
Fix cloning into already dirty dbufs.

Undirty the dbuf and destroy its buffer when cloning into it.

Reported by: Richard Yao
Reported by: Benjamin Coddington
Coverity ID: CID-1535375
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 1/3
Pawel Jakub Dawidek
Retry if we find dbuf dirty again.

Pull-request: #14655 part 3/3
Pawel Jakub Dawidek
Protect db_buf with db_mtx.

It is safe to call arc_buf_destroy() with db_mtx held.

Pointed out by: amotin
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 2/3
Pawel Jakub Dawidek
Fix cloning into already dirty dbufs.

Undirty the dbuf and destroy its buffer when cloning into it.

Reported by: Richard Yao
Reported by: Benjamin Coddington
Coverity ID: CID-1535375
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 1/3
Pawel Jakub Dawidek
Protect db_buf with db_mtx.

It is safe to call arc_buf_destroy() with db_mtx held.

Pointed out by: amotin
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 2/2
Pawel Jakub Dawidek
Fix cloning into already dirty dbufs.

Undirty the dbuf and destroy its buffer when cloning into it.

Reported by: Richard Yao
Reported by: Benjamin Coddington
Coverity ID: CID-1535375
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>

Pull-request: #14655 part 1/2
Ameer Hamza
Update vdev state for spare vdev

zfsd fetches new pool configuration through ZFS_IOC_POOL_STATS but
it does not get updated nvlist configuration for spare vdev since
the configuration is read by spa_spares->sav_config. In this commit,
updating the vdev state for spare vdev that is consumed by zfsd on
spare disk hotplug.

Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>

Pull-request: #14653 part 1/1
Rob Norris
zdb: add -B option to generate backup stream

This is more-or-less like `zfs send`, but specifying the snapshot by its
objset id for situations where it can't be referenced any other way.

Sponsored-By: Klara, Inc.
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>

Pull-request: #14642 part 1/1
George Amanakis
Fixes in persistent error log

Address the following bugs in persistent error log:
1) Check nested clones, eg "fs->snap->clone->snap2->clone2".

2) When deleting files containing error blocks in those clones (from
  "clone" the example above), do not break the check chain.

3) When deleting files in the originating fs before syncing the errlog
  to disk, do not break the check chain. This happens because at the
  time of introducing the error block in the error list, we do not have
  its birth txg and the head filesystem. If the original file is
  deleted before the error list is synced to the error log (which is
  when we actually lookup the birth txg and the head filesystem), then
  we do not have access to this info anymore and break the check chain.

The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accomodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).

We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.

The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64

We also introduced two new tests covering the above changes.

Signed-off-by: George Amanakis <gamanakis@gmail.com>

Pull-request: #14633 part 1/1
George Amanakis
Fixes in persistent error log

Address the following bugs in persistent error log:
1) Check nested clones, eg "fs->snap->clone->snap2->clone2".

2) When deleting files containing error blocks in those clones (from
  "clone" the example above), do not break the check chain.

3) When deleting files in the originating fs before syncing the errlog
  to disk, do not break the check chain. This happens because at the
  time of introducing the error block in the error list, we do not have
  its birth txg and the head filesystem. If the original file is
  deleted before the error list is synced to the error log (which is
  when we actually lookup the birth txg and the head filesystem), then
  we do not have access to this info anymore and break the check chain.

The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accomodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).

We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.

The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64

We also introduced two new tests covering the above changes.

Signed-off-by: George Amanakis <gamanakis@gmail.com>

Pull-request: #14633 part 1/1
Timothy Day
Fix kmodtool for packaging mainline Linux

kmodtool currently incorrectly identifies offical
RHEL kernels, as opposed to custom kernels. This
can cause the openZFS kmod RPM build to break.

The issue can be reproduced by building a set of
mainline Linux RPMs, installing them, and then
attempting to build the openZFS kmod package
against them.

Signed-off-by: Timothy Day <timday@amazon.com>

Pull-request: #14617 part 1/1
GitHub
Merge branch 'openzfs:master' into master

Pull-request: #14617 part 2/2
Timothy Day
Fix kmodtool for building against mainline Linux

kmodtool currently incorrectly identifies offical
RHEL kernels, as opposed to custom kernels. This
can cause the openZFS kmod RPM build to break.

The issue can be reproduced by building a set of
mainline Linux RPMs, installing them, and then
attempting to build the openZFS kmod package
against them.

Signed-off-by: Timothy Day <timday@amazon.com>

Pull-request: #14617 part 1/2
Paul Dagnelie
tweak warning semantics

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Pull-request: #14145 part 2/2
Paul Dagnelie
DLPX-82702 Storage device expansion "silently" fails on degraded vdev

Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Pull-request: #14145 part 1/2
Jason Lee
ZFS Interface for Accelerators (Z.I.A.)

The ZIO pipeline has been modified to allow for external, alternative
implementations of existing operations to be used. The original ZFS
functions remain in the code as fallback in case the external
implementation fails.

Definitions:
    Accelerator - an entity (usually hardware) that is intended to
                  accelerate operations
    Offloader  - synonym of accelerator; used interchangeably
    Data Processing Unit Services Module (DPUSM)
                - https://github.com/hpc/dpusm
                - defines a "provider API" for accelerator
                  vendors to set up
                - defines a "user API" for accelerator consumers
                  to call
                - maintains list of providers and coordinates
                  interactions between providers and consumers.
    Provider    - a DPUSM wrapper for an accelerator's API
    Offload    - moving data from ZFS/memory to the accelerator
    Onload      - the opposite of offload

In order for Z.I.A. to be extensible, it does not directly communicate
with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM,
which is then used to acquire handles to providers.

Using ZFS with Z.I.A.:
    1. Build and start the DPUSM
    2. Implement, build, and register a provider with the DPUSM
    3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
    4. Rebuild and start ZFS
    5. Create a zpool
    6. Select the provider
          zpool set zia_provider=<provider name> <zpool>
    7. Select operations to offload
          zpool set zia_<property>=on <zpool>

The operations that have been modified are:
    - compression
        - non-raw-writes only
    - decompression
    - checksum
        - not handling embedded checksums
        - checksum compute and checksum error call the same function
    - raidz
        - generation
        - reconstruction
    - vdev_file
        - open
        - write
        - close
    - vdev_disk
        - open
        - invalidate
        - write
        - flush
        - close

Successful operations do not bring data back into memory after they
complete, allowing for subsequent offloader operations reuse the
data. This results in only one data movement per ZIO at the beginning
of a pipeline that is necessary for getting data from ZFS to the
accelerator.

When errors ocurr and the offloaded data is still accessible, the
offloaded data will be onloaded (or dropped if it still matches the
in-memory copy) for that ZIO pipeline stage and processed with
ZFS. This will cause thrashing if a later operation offloads
data. This should not happen often, as constant errors (resulting in
data movement) is not expected to be the norm.

Unrecoverable errors such as hardware failures will trigger pipeline
restarts (if necessary) in order to complete the original ZIO using
the software path.

The modifications to ZFS can be thought of as two sets of changes:
    - The ZIO write pipeline
        - compression, checksum, RAIDZ generation, and write
        - Each stage starts by offloading data that was not previously
          offloaded
            - This allows for ZIOs to be offloaded at any point in
              the pipeline
    - Resilver
        - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
          RAIDZ generation), and write
        - Because the core of resilver is vdev_raidz_io_done, data is
          only offloaded once at the beginning of vdev_raidz_io_done
            - Errors cause data to be onloaded, but will not
              re-offload in subsequent steps within resilver
            - Write is a separate ZIO pipeline stage, so it will
              attempt to offload data

The zio_decompress function has been modified to allow for offloading
but the ZIO read pipeline as a whole has not, so it is not part of the
above list.

An example provider implementation can be found in module/zia-software-provider
    - The provider's "hardware" is actually software - data is
      "offloaded" to memory not owned by ZFS
    - Calls ZFS functions in order to not reimplement operations
    - Has kernel module parameters that can be used to trigger
      ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional
"void *<prefix>_zia_handle" member. These opaque handles point to data
that is located on an offloader. abds are still allocated, but their
contents are expected to diverge from the offloaded copy as operations
are run.

Encryption and deduplication are disabled for zpools with
Z.I.A. operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

Signed-off-by: Jason Lee <jasonlee@lanl.gov>

Pull-request: #13628 part 1/1
Jason Lee
ZFS Interface for Accelerators (Z.I.A.)

The ZIO pipeline has been modified to allow for external, alternative
implementations of existing operations to be used. The original ZFS
functions remain in the code as fallback in case the external
implementation fails.

Definitions:
    Accelerator - an entity (usually hardware) that is intended to
                  accelerate operations
    Offloader  - synonym of accelerator; used interchangeably
    Data Processing Unit Services Module (DPUSM)
                - https://github.com/hpc/dpusm
                - defines a "provider API" for accelerator
                  vendors to set up
                - defines a "user API" for accelerator consumers
                  to call
                - maintains list of providers and coordinates
                  interactions between providers and consumers.
    Provider    - a DPUSM wrapper for an accelerator's API
    Offload    - moving data from ZFS/memory to the accelerator
    Onload      - the opposite of offload

In order for Z.I.A. to be extensible, it does not directly communicate
with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM,
which is then used to acquire handles to providers.

Using ZFS with Z.I.A.:
    1. Build and start the DPUSM
    2. Implement, build, and register a provider with the DPUSM
    3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
    4. Rebuild and start ZFS
    5. Create a zpool
    6. Select the provider
          zpool set zia_provider=<provider name> <zpool>
    7. Select operations to offload
          zpool set zia_<property>=on <zpool>

The operations that have been modified are:
    - compression
        - non-raw-writes only
    - decompression
    - checksum
        - not handling embedded checksums
        - checksum compute and checksum error call the same function
    - raidz
        - generation
        - reconstruction
    - vdev_file
        - open
        - write
        - close
    - vdev_disk
        - open
        - invalidate
        - write
        - flush
        - close

Successful operations do not bring data back into memory after they
complete, allowing for subsequent offloader operations reuse the
data. This results in only one data movement per ZIO at the beginning
of a pipeline that is necessary for getting data from ZFS to the
accelerator.

When errors ocurr and the offloaded data is still accessible, the
offloaded data will be onloaded (or dropped if it still matches the
in-memory copy) for that ZIO pipeline stage and processed with
ZFS. This will cause thrashing if a later operation offloads
data. This should not happen often, as constant errors (resulting in
data movement) is not expected to be the norm.

Unrecoverable errors such as hardware failures will trigger pipeline
restarts (if necessary) in order to complete the original ZIO using
the software path.

The modifications to ZFS can be thought of as two sets of changes:
    - The ZIO write pipeline
        - compression, checksum, RAIDZ generation, and write
        - Each stage starts by offloading data that was not previously
          offloaded
            - This allows for ZIOs to be offloaded at any point in
              the pipeline
    - Resilver
        - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
          RAIDZ generation), and write
        - Because the core of resilver is vdev_raidz_io_done, data is
          only offloaded once at the beginning of vdev_raidz_io_done
            - Errors cause data to be onloaded, but will not
              re-offload in subsequent steps within resilver
            - Write is a separate ZIO pipeline stage, so it will
              attempt to offload data

The zio_decompress function has been modified to allow for offloading
but the ZIO read pipeline as a whole has not, so it is not part of the
above list.

An example provider implementation can be found in module/zia-software-provider
    - The provider's "hardware" is actually software - data is
      "offloaded" to memory not owned by ZFS
    - Calls ZFS functions in order to not reimplement operations
    - Has kernel module parameters that can be used to trigger
      ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional
"void *<prefix>_zia_handle" member. These opaque handles point to data
that is located on an offloader. abds are still allocated, but their
contents are expected to diverge from the offloaded copy as operations
are run.

Encryption and deduplication are disabled for zpools with
Z.I.A. operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

Signed-off-by: Jason Lee <jasonlee@lanl.gov>

Pull-request: #13628 part 1/1
Jason Lee
ZFS Interface for Accelerators (Z.I.A.)

The ZIO pipeline has been modified to allow for external, alternative
implementations of existing operations to be used. The original ZFS
functions remain in the code as fallback in case the external
implementation fails.

Definitions:
    Accelerator - an entity (usually hardware) that is intended to
                  accelerate operations
    Offloader  - synonym of accelerator; used interchangeably
    Data Processing Unit Services Module (DPUSM)
                - https://github.com/hpc/dpusm
                - defines a "provider API" for accelerator
                  vendors to set up
                - defines a "user API" for accelerator consumers
                  to call
                - maintains list of providers and coordinates
                  interactions between providers and consumers.
    Provider    - a DPUSM wrapper for an accelerator's API
    Offload    - moving data from ZFS/memory to the accelerator
    Onload      - the opposite of offload

In order for Z.I.A. to be extensible, it does not directly communicate
with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM,
which is then used to acquire handles to providers.

Using ZFS with Z.I.A.:
    1. Build and start the DPUSM
    2. Implement, build, and register a provider with the DPUSM
    3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
    4. Rebuild and start ZFS
    5. Create a zpool
    6. Select the provider
          zpool set zia_provider=<provider name> <zpool>
    7. Select operations to offload
          zpool set zia_<property>=on <zpool>

The operations that have been modified are:
    - compression
        - non-raw-writes only
    - decompression
    - checksum
        - not handling embedded checksums
        - checksum compute and checksum error call the same function
    - raidz
        - generation
        - reconstruction
    - vdev_file
        - open
        - write
        - close
    - vdev_disk
        - open
        - invalidate
        - write
        - flush
        - close

Successful operations do not bring data back into memory after they
complete, allowing for subsequent offloader operations reuse the
data. This results in only one data movement per ZIO at the beginning
of a pipeline that is necessary for getting data from ZFS to the
accelerator.

When errors ocurr and the offloaded data is still accessible, the
offloaded data will be onloaded (or dropped if it still matches the
in-memory copy) for that ZIO pipeline stage and processed with
ZFS. This will cause thrashing if a later operation offloads
data. This should not happen often, as constant errors (resulting in
data movement) is not expected to be the norm.

Unrecoverable errors such as hardware failures will trigger pipeline
restarts (if necessary) in order to complete the original ZIO using
the software path.

The modifications to ZFS can be thought of as two sets of changes:
    - The ZIO write pipeline
        - compression, checksum, RAIDZ generation, and write
        - Each stage starts by offloading data that was not previously
          offloaded
            - This allows for ZIOs to be offloaded at any point in
              the pipeline
    - Resilver
        - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
          RAIDZ generation), and write
        - Because the core of resilver is vdev_raidz_io_done, data is
          only offloaded once at the beginning of vdev_raidz_io_done
            - Errors cause data to be onloaded, but will not
              re-offload in subsequent steps within resilver
            - Write is a separate ZIO pipeline stage, so it will
              attempt to offload data

The zio_decompress function has been modified to allow for offloading
but the ZIO read pipeline as a whole has not, so it is not part of the
above list.

An example provider implementation can be found in module/zia-software-provider
    - The provider's "hardware" is actually software - data is
      "offloaded" to memory not owned by ZFS
    - Calls ZFS functions in order to not reimplement operations
    - Has kernel module parameters that can be used to trigger
      ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional
"void *<prefix>_zia_handle" member. These opaque handles point to data
that is located on an offloader. abds are still allocated, but their
contents are expected to diverge from the offloaded copy as operations
are run.

Encryption and deduplication are disabled for zpools with
Z.I.A. operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

Signed-off-by: Jason Lee <jasonlee@lanl.gov>

Pull-request: #13628 part 1/1
Jason Lee
ZFS Interface for Accelerators (Z.I.A.)

The ZIO pipeline has been modified to allow for external, alternative
implementations of existing operations to be used. The original ZFS
functions remain in the code as fallback in case the external
implementation fails.

Definitions:
    Accelerator - an entity (usually hardware) that is intended to
                  accelerate operations
    Offloader  - synonym of accelerator; used interchangeably
    Data Processing Unit Services Module (DPUSM)
                - https://github.com/hpc/dpusm
                - defines a "provider API" for accelerator
                  vendors to set up
                - defines a "user API" for accelerator consumers
                  to call
                - maintains list of providers and coordinates
                  interactions between providers and consumers.
    Provider    - a DPUSM wrapper for an accelerator's API
    Offload    - moving data from ZFS/memory to the accelerator
    Onload      - the opposite of offload

In order for Z.I.A. to be extensible, it does not directly communicate
with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM,
which is then used to acquire handles to providers.

Using ZFS with Z.I.A.:
    1. Build and start the DPUSM
    2. Implement, build, and register a provider with the DPUSM
    3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
    4. Rebuild and start ZFS
    5. Create a zpool
    6. Select the provider
          zpool set zia_provider=<provider name> <zpool>
    7. Select operations to offload
          zpool set zia_<property>=on <zpool>

The operations that have been modified are:
    - compression
        - non-raw-writes only
    - decompression
    - checksum
        - not handling embedded checksums
        - checksum compute and checksum error call the same function
    - raidz
        - generation
        - reconstruction
    - vdev_file
        - open
        - write
        - close
    - vdev_disk
        - open
        - invalidate
        - write
        - flush
        - close

Successful operations do not bring data back into memory after they
complete, allowing for subsequent offloader operations reuse the
data. This results in only one data movement per ZIO at the beginning
of a pipeline that is necessary for getting data from ZFS to the
accelerator.

When errors ocurr and the offloaded data is still accessible, the
offloaded data will be onloaded (or dropped if it still matches the
in-memory copy) for that ZIO pipeline stage and processed with
ZFS. This will cause thrashing if a later operation offloads
data. This should not happen often, as constant errors (resulting in
data movement) is not expected to be the norm.

Unrecoverable errors such as hardware failures will trigger pipeline
restarts (if necessary) in order to complete the original ZIO using
the software path.

The modifications to ZFS can be thought of as changes to two pipelines:
    - The ZIO write pipeline
        - compression, checksum, RAIDZ generation, and write
        - Each stage starts by offloading data that was not previously
          offloaded
            - This allows for ZIOs to be offloaded at any point in
              the pipeline
    - Resilver
        - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
          RAIDZ generation), and write
        - Because the core of resilver is vdev_raidz_io_done, data is
          only offloaded once at the beginning of vdev_raidz_io_done
            - Errors cause data to be onloaded, but will not
              re-offload in subsequent steps within resilver
            - Write is a separate ZIO pipeline stage, so it will
              attempt to offload data

The zio_decompress function has been modified to allow for offloading
but the ZIO read pipeline as a whole has not, so it is not part of the
above list.

An example provider implementation can be found in module/zia-software-provider
    - The provider's "hardware" is actually software - data is
      "offloaded" to memory not owned by ZFS
    - Calls ZFS functions in order to not reimplement operations
    - Has kernel module parameters that can be used to trigger
      ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional
"void *<prefix>_zia_handle" member. These opaque handles point to data
that is located on an offloader. abds are still allocated, but their
contents are expected to diverge from the offloaded copy as operations
are run.

ARC compression is disabled when Z.I.A. is configured

Encryption and deduplication are disabled for zpools with
Z.I.A. operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

Signed-off-by: Jason Lee <jasonlee@lanl.gov>

Pull-request: #13628 part 1/1
Jason Lee
ZFS Interface for Accelerators (Z.I.A.)

The ZIO pipeline has been modified to allow for external, alternative
implementations of existing operations to be used. The original ZFS
functions remain in the code as fallback in case the external
implementation fails.

Definitions:
    Accelerator - an entity (usually hardware) that is intended to
                  accelerate operations
    Offloader  - synonym of accelerator; used interchangeably
    Data Processing Unit Services Module (DPUSM)
                - https://github.com/hpc/dpusm
                - defines a "provider API" for accelerator
                  vendors to set up
                - defines a "user API" for accelerator consumers
                  to call
                - maintains list of providers and coordinates
                  interactions between providers and consumers.
    Provider    - a DPUSM wrapper for an accelerator's API
    Offload    - moving data from ZFS/memory to the accelerator
    Onload      - the opposite of offload

In order for Z.I.A. to be extensible, it does not directly communicate
with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM,
which is then used to acquire handles to providers.

Using ZFS with Z.I.A.:
    1. Build and start the DPUSM
    2. Implement, build, and register a provider with the DPUSM
    3. Reconfigure ZFS with '--with-zia=<DPUSM root>'
    4. Rebuild and start ZFS
    5. Create a zpool
    6. Select the provider
          zpool set zia_provider=<provider name> <zpool>
    7. Select operations to offload
          zpool set zia_<property>=on <zpool>

The operations that have been modified are:
    - compression
        - non-raw-writes only
    - decompression
    - checksum
        - not handling embedded checksums
        - checksum compute and checksum error call the same function
    - raidz
        - generation
        - reconstruction
    - vdev_file
        - open
        - write
        - close
    - vdev_disk
        - open
        - invalidate
        - write
        - flush
        - close

Successful operations do not bring data back into memory after they
complete, allowing for subsequent offloader operations reuse the
data. This results in only one data movement per ZIO at the beginning
of a pipeline that is necessary for getting data from ZFS to the
accelerator.

When errors ocurr and the offloaded data is still accessible, the
offloaded data will be onloaded (or dropped if it still matches the
in-memory copy) for that ZIO pipeline stage and processed with
ZFS. This will cause thrashing if a later operation offloads
data. This should not happen often, as constant errors (resulting in
data movement) is not expected to be the norm.

Unrecoverable errors such as hardware failures will trigger pipeline
restarts (if necessary) in order to complete the original ZIO using
the software path.

The modifications to ZFS can be thought of as changes to two pipelines:
    - The ZIO write pipeline
        - compression, checksum, RAIDZ generation, and write
        - Each stage starts by offloading data that was not previously
          offloaded
            - This allows for ZIOs to be offloaded at any point in
              the pipeline
    - Resilver
        - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and
          RAIDZ generation), and write
        - Because the core of resilver is vdev_raidz_io_done, data is
          only offloaded once at the beginning of vdev_raidz_io_done
            - Errors cause data to be onloaded, but will not
              re-offload in subsequent steps within resilver
            - Write is a separate ZIO pipeline stage, so it will
              attempt to offload data

The zio_decompress function has been modified to allow for offloading
but the ZIO read pipeline as a whole has not, so it is not part of the
above list.

An example provider implementation can be found in module/zia-software-provider
    - The provider's "hardware" is actually software - data is
      "offloaded" to memory not owned by ZFS
    - Calls ZFS functions in order to not reimplement operations
    - Has kernel module parameters that can be used to trigger
      ZIA_ACCELERATOR_DOWN states for testing pipeline restarts.

abd_t, raidz_row_t, and vdev_t have each been given an additional
"void *<prefix>_zia_handle" member. These opaque handles point to data
that is located on an offloader. abds are still allocated, but their
contents are expected to diverge from the offloaded copy as operations
are run.

ARC compression is disabled when Z.I.A. is configured

Encryption and deduplication are disabled for zpools with
Z.I.A. operations enabled

Aggregation is disabled for offloaded abds

RPMs will build with Z.I.A.

Signed-off-by: Jason Lee <jasonlee@lanl.gov>

Pull-request: #13628 part 1/1