Console View
Tags: Architectures Platforms default |
|
Architectures | Platforms | default | ||||||||||||||||
|
|
|
||||||||||||||||
Rich Ercolani
rincebrain @gmail.com |
|
|
|
|||||||||||||||
Add another bdev_io_acct case for Linux 6.3+ compat Linux 6.3+, and backports from it, changed the signatures on bdev_io_{start,end}_acct. Add a case for it. Fixes: #14658 Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Pull-request: #14668 part 1/1 |
||||||||||||||||||
Ameer Hamza
ahamza @ixsystems.com |
|
|
|
|||||||||||||||
zed: add hotplug support for spare vdevs This commit supports for spare vdev hotplug. The spare vdev associated with all the pools will be marked as "Removed" when the driveĀ is physically detached and will become "Available" when the drive is reattached. Currently, the spare vdev status does not change on the drive removal and the same is the case with reattachment. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14295 Pull-request: #14667 part 3/3 |
||||||||||||||||||
Ameer Hamza
ahamza @ixsystems.com |
|
|
|
|||||||||||||||
zed: post a udev change event from spa_vdev_attach() In order for zed to process the removal event correctly, udev change event needs to be posted to sync the blkid information. spa_create() and spa_config_update() posts the event already through spa_write_cachefile(). Doing the same for spa_vdev_attach() that handles the case for vdev attachment and replacement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14172 Pull-request: #14667 part 2/3 |
||||||||||||||||||
Ameer Hamza
ahamza @ixsystems.com |
|
|
|
|||||||||||||||
zed: mark disks as REMOVED when they are removed ZED does not take any action for disk removal events if there is no spare VDEV available. Added zpool_vdev_remove_wanted() in libzfs and vdev_remove_wanted() in vdev.c to remove the VDEV through ZED on removal event. This means that if you are running zed and remove a disk, it will be propertly marked as REMOVED. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Pull-request: #14667 part 1/3 |
||||||||||||||||||
Rob Norris
robn @despairlabs.com |
|
|
|
|||||||||||||||
man: add ZIO_STAGE_BRT_FREE to zpool-events And bump all the values after it, matching the header update in 67a1b037. Signed-off-by: Rob Norris <robn@despairlabs.com> Pull-request: #14665 part 1/1 |
||||||||||||||||||
Rob Norris
robn @despairlabs.com |
|
|
|
|||||||||||||||
man: add ZIO_STAGE_BRT_FREE to zpool-events And bump all the values after it, matching the header update in 67a1b037. Signed-off-by: Rob Norris <robn@despairlabs.com> Pull-request: #14665 part 1/1 |
||||||||||||||||||
Alexander Motin
mav @FreeBSD.org |
|
|
|
|||||||||||||||
Improve arc_read() error reporting Debugging reported NULL de-reference panic in dnode_hold_impl() I found that for certain types of errors arc_read() may only return error code, but not properly report it via done and pio arguments. Lack of done calls may result in reference and/or memory leaks in higher level code. Lack of error reporting via pio may result in unnoticed errors there. For example, dbuf_read(), where dbuf_read_impl() ignores arc_read() return, relies completely on the pio mechanism and missed the errors. This patch makes arc_read() to always call done callback and always propagate errors to parent zio, if either is provided. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Pull-request: #14663 part 1/1 |
||||||||||||||||||
Herbert Wartens
wartens2 @llnl.gov |
|
|
|
|||||||||||||||
Allow MMP to bypass waiting for other threads At our site we have seen cases when multi-modifier protection is enabled (multihost=on) on our pool and the pool gets suspended due to a single disk that is failing and responding very slowly. Our pools have 90 disks in them and we expect disks to fail. The current version of MMP requires that we wait for other writers before moving on. When a disk is responding very slowly, we observed that waiting here was bad enough to cause the pool to suspend. This change allows the MMP thread to bypass waiting for other threads and reduces the chances the pool gets suspended. Signed-off-by: Herb Wartens <hawartens@gmail.com> Pull-request: #14659 part 1/1 |
||||||||||||||||||
GitHub
noreply @github.com |
|
|
|
|||||||||||||||
Merge branch 'openzfs:master' into mmp-bypass-wait Pull-request: #14657 part 2/2 |
||||||||||||||||||
Herbert Wartens
wartens2 @llnl.gov |
|
|
|
|||||||||||||||
Allow MMP to bypass waiting for other threads At our site we have seen cases when multi-modifier protection is enabled (multihost=on) on our pool and the pool gets suspended due to a single disk that is failing and responding very slowly. Our pools have 90 disks in them and we expect disks to fail. The current version of MMP requires that we wait for other writers before moving on. When a disk is responding very slowly, we observed that waiting here was bad enough to cause the pool to suspend. This change allows the MMP thread to bypass waiting for other threads and reduces the chances the pool gets suspended. Signed-off-by: Herb Wartens <hawartens@gmail.com> Pull-request: #14657 part 1/1 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Fix build on FreeBSD. Constify some variables after d1807f168edd09ca26a5a0c6b570686b982808ad. Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14656 part 1/1 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Protect db_buf with db_mtx. It is safe to call arc_buf_destroy() with db_mtx held. Pointed out by: amotin Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 2/2 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Fix cloning into already dirty dbufs. Undirty the dbuf and destroy its buffer when cloning into it. Reported by: Richard Yao Reported by: Benjamin Coddington Coverity ID: CID-1535375 Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 1/2 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Retry if we find dbuf dirty again. Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 3/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Protect db_buf with db_mtx. It is safe to call arc_buf_destroy() with db_mtx held. Pointed out by: amotin Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 2/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Fix cloning into already dirty dbufs. Undirty the dbuf and destroy its buffer when cloning into it. Reported by: Richard Yao Reported by: Benjamin Coddington Coverity ID: CID-1535375 Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 1/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Retry if we find dbuf dirty again. Pull-request: #14655 part 3/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Protect db_buf with db_mtx. It is safe to call arc_buf_destroy() with db_mtx held. Pointed out by: amotin Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 2/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Fix cloning into already dirty dbufs. Undirty the dbuf and destroy its buffer when cloning into it. Reported by: Richard Yao Reported by: Benjamin Coddington Coverity ID: CID-1535375 Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 1/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Retry if we find dbuf dirty again. Pull-request: #14655 part 3/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Protect db_buf with db_mtx. It is safe to call arc_buf_destroy() with db_mtx held. Pointed out by: amotin Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 2/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Fix cloning into already dirty dbufs. Undirty the dbuf and destroy its buffer when cloning into it. Reported by: Richard Yao Reported by: Benjamin Coddington Coverity ID: CID-1535375 Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 1/3 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Protect db_buf with db_mtx. It is safe to call arc_buf_destroy() with db_mtx held. Pointed out by: amotin Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 2/2 |
||||||||||||||||||
Pawel Jakub Dawidek
pawel @dawidek.net |
|
|
|
|||||||||||||||
Fix cloning into already dirty dbufs. Undirty the dbuf and destroy its buffer when cloning into it. Reported by: Richard Yao Reported by: Benjamin Coddington Coverity ID: CID-1535375 Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Pull-request: #14655 part 1/2 |
||||||||||||||||||
Ameer Hamza
ahamza @ixsystems.com |
|
|
|
|||||||||||||||
Update vdev state for spare vdev zfsd fetches new pool configuration through ZFS_IOC_POOL_STATS but it does not get updated nvlist configuration for spare vdev since the configuration is read by spa_spares->sav_config. In this commit, updating the vdev state for spare vdev that is consumed by zfsd on spare disk hotplug. Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Pull-request: #14653 part 1/1 |
||||||||||||||||||
Rob Norris
rob.norris @klarasystems.com |
|
|
|
|||||||||||||||
zdb: add -B option to generate backup stream This is more-or-less like `zfs send`, but specifying the snapshot by its objset id for situations where it can't be referenced any other way. Sponsored-By: Klara, Inc. Signed-off-by: Rob Norris <rob.norris@klarasystems.com> Pull-request: #14642 part 1/1 |
||||||||||||||||||
George Amanakis
gamanakis @gmail.com |
|
|
|
|||||||||||||||
Fixes in persistent error log Address the following bugs in persistent error log: 1) Check nested clones, eg "fs->snap->clone->snap2->clone2". 2) When deleting files containing error blocks in those clones (from "clone" the example above), do not break the check chain. 3) When deleting files in the originating fs before syncing the errlog to disk, do not break the check chain. This happens because at the time of introducing the error block in the error list, we do not have its birth txg and the head filesystem. If the original file is deleted before the error list is synced to the error log (which is when we actually lookup the birth txg and the head filesystem), then we do not have access to this info anymore and break the check chain. The most prominent change is related to achieving (3). We expand the spa_error_entry_t structure to accomodate the newly introduced zbookmark_err_phys_t structure (containing the birth txg of the error block).Due to compatibility reasons we cannot remove the zbookmark_phys_t structure and we also need to place the new structure after se_avl, so it is not accounted for in avl_find(). Then we modify spa_log_error() to also provide the birth txg of the error block. With these changes in place we simplify the previously introduced function get_head_and_birth_txg() (now named get_head_ds()). We chose not to follow the same approach for the head filesystem (thus completely removing get_head_ds()) to avoid introducing new lock contentions. The stack sizes of nested functions (as measured by checkstack.pl in the linux kernel) are: check_filesystem [zfs]: 272 (was 912) check_clones [zfs]: 64 We also introduced two new tests covering the above changes. Signed-off-by: George Amanakis <gamanakis@gmail.com> Pull-request: #14633 part 1/1 |
||||||||||||||||||
George Amanakis
gamanakis @gmail.com |
|
|
|
|||||||||||||||
Fixes in persistent error log Address the following bugs in persistent error log: 1) Check nested clones, eg "fs->snap->clone->snap2->clone2". 2) When deleting files containing error blocks in those clones (from "clone" the example above), do not break the check chain. 3) When deleting files in the originating fs before syncing the errlog to disk, do not break the check chain. This happens because at the time of introducing the error block in the error list, we do not have its birth txg and the head filesystem. If the original file is deleted before the error list is synced to the error log (which is when we actually lookup the birth txg and the head filesystem), then we do not have access to this info anymore and break the check chain. The most prominent change is related to achieving (3). We expand the spa_error_entry_t structure to accomodate the newly introduced zbookmark_err_phys_t structure (containing the birth txg of the error block).Due to compatibility reasons we cannot remove the zbookmark_phys_t structure and we also need to place the new structure after se_avl, so it is not accounted for in avl_find(). Then we modify spa_log_error() to also provide the birth txg of the error block. With these changes in place we simplify the previously introduced function get_head_and_birth_txg() (now named get_head_ds()). We chose not to follow the same approach for the head filesystem (thus completely removing get_head_ds()) to avoid introducing new lock contentions. The stack sizes of nested functions (as measured by checkstack.pl in the linux kernel) are: check_filesystem [zfs]: 272 (was 912) check_clones [zfs]: 64 We also introduced two new tests covering the above changes. Signed-off-by: George Amanakis <gamanakis@gmail.com> Pull-request: #14633 part 1/1 |
||||||||||||||||||
Timothy Day
timday @amazon.com |
|
|
|
|||||||||||||||
Fix kmodtool for packaging mainline Linux kmodtool currently incorrectly identifies offical RHEL kernels, as opposed to custom kernels. This can cause the openZFS kmod RPM build to break. The issue can be reproduced by building a set of mainline Linux RPMs, installing them, and then attempting to build the openZFS kmod package against them. Signed-off-by: Timothy Day <timday@amazon.com> Pull-request: #14617 part 1/1 |
||||||||||||||||||
GitHub
noreply @github.com |
|
|
|
|||||||||||||||
Merge branch 'openzfs:master' into master Pull-request: #14617 part 2/2 |
||||||||||||||||||
Timothy Day
timday @amazon.com |
|
|
|
|||||||||||||||
Fix kmodtool for building against mainline Linux kmodtool currently incorrectly identifies offical RHEL kernels, as opposed to custom kernels. This can cause the openZFS kmod RPM build to break. The issue can be reproduced by building a set of mainline Linux RPMs, installing them, and then attempting to build the openZFS kmod package against them. Signed-off-by: Timothy Day <timday@amazon.com> Pull-request: #14617 part 1/2 |
||||||||||||||||||
Paul Dagnelie
pcd @delphix.com |
|
|
|
|||||||||||||||
tweak warning semantics Signed-off-by: Paul Dagnelie <pcd@delphix.com> Pull-request: #14145 part 2/2 |
||||||||||||||||||
Paul Dagnelie
pcd @delphix.com |
|
|
|
|||||||||||||||
DLPX-82702 Storage device expansion "silently" fails on degraded vdev Signed-off-by: Paul Dagnelie <pcd@delphix.com> Pull-request: #14145 part 1/2 |
||||||||||||||||||
Jason Lee
jasonlee @lanl.gov |
|
|
|
|||||||||||||||
ZFS Interface for Accelerators (Z.I.A.) The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails. Definitions: Accelerator - an entity (usually hardware) that is intended to accelerate operations Offloader - synonym of accelerator; used interchangeably Data Processing Unit Services Module (DPUSM) - https://github.com/hpc/dpusm - defines a "provider API" for accelerator vendors to set up - defines a "user API" for accelerator consumers to call - maintains list of providers and coordinates interactions between providers and consumers. Provider - a DPUSM wrapper for an accelerator's API Offload - moving data from ZFS/memory to the accelerator Onload - the opposite of offload In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers. Using ZFS with Z.I.A.: 1. Build and start the DPUSM 2. Implement, build, and register a provider with the DPUSM 3. Reconfigure ZFS with '--with-zia=<DPUSM root>' 4. Rebuild and start ZFS 5. Create a zpool 6. Select the provider zpool set zia_provider=<provider name> <zpool> 7. Select operations to offload zpool set zia_<property>=on <zpool> The operations that have been modified are: - compression - non-raw-writes only - decompression - checksum - not handling embedded checksums - checksum compute and checksum error call the same function - raidz - generation - reconstruction - vdev_file - open - write - close - vdev_disk - open - invalidate - write - flush - close Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator. When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm. Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path. The modifications to ZFS can be thought of as two sets of changes: - The ZIO write pipeline - compression, checksum, RAIDZ generation, and write - Each stage starts by offloading data that was not previously offloaded - This allows for ZIOs to be offloaded at any point in the pipeline - Resilver - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write - Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done - Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver - Write is a separate ZIO pipeline stage, so it will attempt to offload data The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it is not part of the above list. An example provider implementation can be found in module/zia-software-provider - The provider's "hardware" is actually software - data is "offloaded" to memory not owned by ZFS - Calls ZFS functions in order to not reimplement operations - Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts. abd_t, raidz_row_t, and vdev_t have each been given an additional "void *<prefix>_zia_handle" member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run. Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled Aggregation is disabled for offloaded abds RPMs will build with Z.I.A. Signed-off-by: Jason Lee <jasonlee@lanl.gov> Pull-request: #13628 part 1/1 |
||||||||||||||||||
|
||||||||||||||||||
Jason Lee
jasonlee @lanl.gov |
|
|
|
|||||||||||||||
ZFS Interface for Accelerators (Z.I.A.) The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails. Definitions: Accelerator - an entity (usually hardware) that is intended to accelerate operations Offloader - synonym of accelerator; used interchangeably Data Processing Unit Services Module (DPUSM) - https://github.com/hpc/dpusm - defines a "provider API" for accelerator vendors to set up - defines a "user API" for accelerator consumers to call - maintains list of providers and coordinates interactions between providers and consumers. Provider - a DPUSM wrapper for an accelerator's API Offload - moving data from ZFS/memory to the accelerator Onload - the opposite of offload In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers. Using ZFS with Z.I.A.: 1. Build and start the DPUSM 2. Implement, build, and register a provider with the DPUSM 3. Reconfigure ZFS with '--with-zia=<DPUSM root>' 4. Rebuild and start ZFS 5. Create a zpool 6. Select the provider zpool set zia_provider=<provider name> <zpool> 7. Select operations to offload zpool set zia_<property>=on <zpool> The operations that have been modified are: - compression - non-raw-writes only - decompression - checksum - not handling embedded checksums - checksum compute and checksum error call the same function - raidz - generation - reconstruction - vdev_file - open - write - close - vdev_disk - open - invalidate - write - flush - close Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator. When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm. Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path. The modifications to ZFS can be thought of as two sets of changes: - The ZIO write pipeline - compression, checksum, RAIDZ generation, and write - Each stage starts by offloading data that was not previously offloaded - This allows for ZIOs to be offloaded at any point in the pipeline - Resilver - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write - Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done - Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver - Write is a separate ZIO pipeline stage, so it will attempt to offload data The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it is not part of the above list. An example provider implementation can be found in module/zia-software-provider - The provider's "hardware" is actually software - data is "offloaded" to memory not owned by ZFS - Calls ZFS functions in order to not reimplement operations - Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts. abd_t, raidz_row_t, and vdev_t have each been given an additional "void *<prefix>_zia_handle" member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run. Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled Aggregation is disabled for offloaded abds RPMs will build with Z.I.A. Signed-off-by: Jason Lee <jasonlee@lanl.gov> Pull-request: #13628 part 1/1 |
||||||||||||||||||
|
||||||||||||||||||
Jason Lee
jasonlee @lanl.gov |
|
|
|
|||||||||||||||
ZFS Interface for Accelerators (Z.I.A.) The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails. Definitions: Accelerator - an entity (usually hardware) that is intended to accelerate operations Offloader - synonym of accelerator; used interchangeably Data Processing Unit Services Module (DPUSM) - https://github.com/hpc/dpusm - defines a "provider API" for accelerator vendors to set up - defines a "user API" for accelerator consumers to call - maintains list of providers and coordinates interactions between providers and consumers. Provider - a DPUSM wrapper for an accelerator's API Offload - moving data from ZFS/memory to the accelerator Onload - the opposite of offload In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers. Using ZFS with Z.I.A.: 1. Build and start the DPUSM 2. Implement, build, and register a provider with the DPUSM 3. Reconfigure ZFS with '--with-zia=<DPUSM root>' 4. Rebuild and start ZFS 5. Create a zpool 6. Select the provider zpool set zia_provider=<provider name> <zpool> 7. Select operations to offload zpool set zia_<property>=on <zpool> The operations that have been modified are: - compression - non-raw-writes only - decompression - checksum - not handling embedded checksums - checksum compute and checksum error call the same function - raidz - generation - reconstruction - vdev_file - open - write - close - vdev_disk - open - invalidate - write - flush - close Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator. When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm. Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path. The modifications to ZFS can be thought of as two sets of changes: - The ZIO write pipeline - compression, checksum, RAIDZ generation, and write - Each stage starts by offloading data that was not previously offloaded - This allows for ZIOs to be offloaded at any point in the pipeline - Resilver - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write - Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done - Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver - Write is a separate ZIO pipeline stage, so it will attempt to offload data The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it is not part of the above list. An example provider implementation can be found in module/zia-software-provider - The provider's "hardware" is actually software - data is "offloaded" to memory not owned by ZFS - Calls ZFS functions in order to not reimplement operations - Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts. abd_t, raidz_row_t, and vdev_t have each been given an additional "void *<prefix>_zia_handle" member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run. Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled Aggregation is disabled for offloaded abds RPMs will build with Z.I.A. Signed-off-by: Jason Lee <jasonlee@lanl.gov> Pull-request: #13628 part 1/1 |
||||||||||||||||||
Jason Lee
jasonlee @lanl.gov |
|
|
|
|||||||||||||||
ZFS Interface for Accelerators (Z.I.A.) The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails. Definitions: Accelerator - an entity (usually hardware) that is intended to accelerate operations Offloader - synonym of accelerator; used interchangeably Data Processing Unit Services Module (DPUSM) - https://github.com/hpc/dpusm - defines a "provider API" for accelerator vendors to set up - defines a "user API" for accelerator consumers to call - maintains list of providers and coordinates interactions between providers and consumers. Provider - a DPUSM wrapper for an accelerator's API Offload - moving data from ZFS/memory to the accelerator Onload - the opposite of offload In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers. Using ZFS with Z.I.A.: 1. Build and start the DPUSM 2. Implement, build, and register a provider with the DPUSM 3. Reconfigure ZFS with '--with-zia=<DPUSM root>' 4. Rebuild and start ZFS 5. Create a zpool 6. Select the provider zpool set zia_provider=<provider name> <zpool> 7. Select operations to offload zpool set zia_<property>=on <zpool> The operations that have been modified are: - compression - non-raw-writes only - decompression - checksum - not handling embedded checksums - checksum compute and checksum error call the same function - raidz - generation - reconstruction - vdev_file - open - write - close - vdev_disk - open - invalidate - write - flush - close Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator. When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm. Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path. The modifications to ZFS can be thought of as changes to two pipelines: - The ZIO write pipeline - compression, checksum, RAIDZ generation, and write - Each stage starts by offloading data that was not previously offloaded - This allows for ZIOs to be offloaded at any point in the pipeline - Resilver - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write - Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done - Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver - Write is a separate ZIO pipeline stage, so it will attempt to offload data The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it is not part of the above list. An example provider implementation can be found in module/zia-software-provider - The provider's "hardware" is actually software - data is "offloaded" to memory not owned by ZFS - Calls ZFS functions in order to not reimplement operations - Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts. abd_t, raidz_row_t, and vdev_t have each been given an additional "void *<prefix>_zia_handle" member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run. ARC compression is disabled when Z.I.A. is configured Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled Aggregation is disabled for offloaded abds RPMs will build with Z.I.A. Signed-off-by: Jason Lee <jasonlee@lanl.gov> Pull-request: #13628 part 1/1 |
||||||||||||||||||
|
||||||||||||||||||
Jason Lee
jasonlee @lanl.gov |
|
|
|
|||||||||||||||
ZFS Interface for Accelerators (Z.I.A.) The ZIO pipeline has been modified to allow for external, alternative implementations of existing operations to be used. The original ZFS functions remain in the code as fallback in case the external implementation fails. Definitions: Accelerator - an entity (usually hardware) that is intended to accelerate operations Offloader - synonym of accelerator; used interchangeably Data Processing Unit Services Module (DPUSM) - https://github.com/hpc/dpusm - defines a "provider API" for accelerator vendors to set up - defines a "user API" for accelerator consumers to call - maintains list of providers and coordinates interactions between providers and consumers. Provider - a DPUSM wrapper for an accelerator's API Offload - moving data from ZFS/memory to the accelerator Onload - the opposite of offload In order for Z.I.A. to be extensible, it does not directly communicate with a fixed accelerator. Rather, Z.I.A. acquires a handle to a DPUSM, which is then used to acquire handles to providers. Using ZFS with Z.I.A.: 1. Build and start the DPUSM 2. Implement, build, and register a provider with the DPUSM 3. Reconfigure ZFS with '--with-zia=<DPUSM root>' 4. Rebuild and start ZFS 5. Create a zpool 6. Select the provider zpool set zia_provider=<provider name> <zpool> 7. Select operations to offload zpool set zia_<property>=on <zpool> The operations that have been modified are: - compression - non-raw-writes only - decompression - checksum - not handling embedded checksums - checksum compute and checksum error call the same function - raidz - generation - reconstruction - vdev_file - open - write - close - vdev_disk - open - invalidate - write - flush - close Successful operations do not bring data back into memory after they complete, allowing for subsequent offloader operations reuse the data. This results in only one data movement per ZIO at the beginning of a pipeline that is necessary for getting data from ZFS to the accelerator. When errors ocurr and the offloaded data is still accessible, the offloaded data will be onloaded (or dropped if it still matches the in-memory copy) for that ZIO pipeline stage and processed with ZFS. This will cause thrashing if a later operation offloads data. This should not happen often, as constant errors (resulting in data movement) is not expected to be the norm. Unrecoverable errors such as hardware failures will trigger pipeline restarts (if necessary) in order to complete the original ZIO using the software path. The modifications to ZFS can be thought of as changes to two pipelines: - The ZIO write pipeline - compression, checksum, RAIDZ generation, and write - Each stage starts by offloading data that was not previously offloaded - This allows for ZIOs to be offloaded at any point in the pipeline - Resilver - vdev_raidz_io_done (RAIDZ reconstruction, checksum, and RAIDZ generation), and write - Because the core of resilver is vdev_raidz_io_done, data is only offloaded once at the beginning of vdev_raidz_io_done - Errors cause data to be onloaded, but will not re-offload in subsequent steps within resilver - Write is a separate ZIO pipeline stage, so it will attempt to offload data The zio_decompress function has been modified to allow for offloading but the ZIO read pipeline as a whole has not, so it is not part of the above list. An example provider implementation can be found in module/zia-software-provider - The provider's "hardware" is actually software - data is "offloaded" to memory not owned by ZFS - Calls ZFS functions in order to not reimplement operations - Has kernel module parameters that can be used to trigger ZIA_ACCELERATOR_DOWN states for testing pipeline restarts. abd_t, raidz_row_t, and vdev_t have each been given an additional "void *<prefix>_zia_handle" member. These opaque handles point to data that is located on an offloader. abds are still allocated, but their contents are expected to diverge from the offloaded copy as operations are run. ARC compression is disabled when Z.I.A. is configured Encryption and deduplication are disabled for zpools with Z.I.A. operations enabled Aggregation is disabled for offloaded abds RPMs will build with Z.I.A. Signed-off-by: Jason Lee <jasonlee@lanl.gov> Pull-request: #13628 part 1/1 |
||||||||||||||||||