SuccessChanges

Summary

  1. linux/io: move block sectorsize related lines together (details)
  2. linux/io: remove support for /sys/class/dax (details)
  3. linux/io: cleanup dax/non-dax devtype management (details)
  4. linux/io: no need for sysfs local_cpus for OSdev locality (details)
  5. linux/io: rework/fix numa_node attribute in sysfs (details)
  6. gather-topology: gather dax driver info (details)
  7. tests/linux: add dax driver information to gathered files (details)
  8. linux: fix and factorize the checking of whether a DAX device is exposed as NUMA node (details)
  9. linux: add DAXParent and DAXType info attr (details)
  10. linux/dax: add some comments (details)
  11. linux/block: replace "NVDIMM" subtype with "NVM" or "SPM" to match DAX attributes (details)
  12. memattrs: heuristics to set NUMA node subtype to DRAM/HBM/SPM/NVM (details)
  13. tests: add memtiers for testing subtypes of heterogeneous memory nodes (details)
  14. tests/linux: add a complex test case with lots of heterogeneous memories (details)
Commit e7aea83b5294578f304d1f046eb4d644960f26d5 by brice.goglin
linux/io: move block sectorsize related lines together

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedhwloc/topology-linux.c (diff)
Commit c413eb14fe9744ecfa84d58c285dba909da617f2 by brice.goglin
linux/io: remove support for /sys/class/dax

DAX devices moved to /sys/bus/dax soon after their support were added
in Linux, distributions disabled it and now it's removed from latest kernels.
Also /sys/class/dax misses many features, such as kmem driver for exposing
as NUMA nodes, and hmem driver for special-purpose memory.
So don't bother supporting it.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedhwloc/topology-linux.c (diff)
Commit b4b352d0ca08ee6ba9ff83c8ab85f4f3ba3ea2ef by brice.goglin
linux/io: cleanup dax/non-dax devtype management

Clarify the difference between "class" and "bus" devices,
which explains why we follow the "device" symlink for the
former and not the latter.

And then clarify why we look at the parent for DAX bus devices.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedhwloc/topology-linux.c (diff)
Commit 48673bc74b8435631325448224508cbf6d5b51ae by brice.goglin
linux/io: no need for sysfs local_cpus for OSdev locality

Most OSdev have a numa_node attribute in their sysfs hierarchy,
but local_cpus is only available PCI sysfs devices.

OSdev that are related to a PCIdev are attached below that
PCIdev. We don't need to look at numa_node or local_cpus
here. The locality of the PCIdev was obtained earlier by
looking at local_cpus in hwloc_linux_backend_get_pci_busid_cpuset().
numa_node is also available (since 2007) but not used.

OSdev that are NOT related to a PCIdev have no relevant
local_cpus to look at, hence this code was never used.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedhwloc/topology-linux.c (diff)
Commit 81b0bf3487698d6fbaa4f90f344712dc1a993e1b by brice.goglin
linux/io: rework/fix numa_node attribute in sysfs

For class devices (from /sys/class/foo), the actual device
is pointed by the "device" symlink, there's usually a numa_node
attribute in there.

For bus devices (from /sys/bus/foo/devices), numa_node can be
in the device itself (hmem DAX devices in some Suse 5.3 kernel),
or in its parent (old DAX devices in pre-5.5 kernels),
or in both (newer kernels).

The previous code only supported the parent case. We now
use the device itself first for bus devices (enough in the vast
majority of cases), and then the parent if a flag was given
(for DAX on some old kernels).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedhwloc/topology-linux.c (diff)
Commit 0b271232477b688e1eac3c0d50dd5db18a1f3650 by brice.goglin
gather-topology: gather dax driver info

We'll use them to identify NUMA nodes vs actual dax files.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedutils/hwloc/hwloc-gather-topology.in (diff)
Commit 4bb528829f9a122217ae9f032de78483bb11da0f by brice.goglin
tests/linux: add dax driver information to gathered files

We'll need them soon.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedtests/hwloc/linux/fakememinitiators-1np2c+1npp+gi.tar.bz2 (diff)
The file was modifiedtests/hwloc/linux/32em64t-2n8c+1mic.tar.bz2 (diff)
Commit 161dbb2c18c5fbf5faceffeb1b0e11134cfdac15 by brice.goglin
linux: fix and factorize the checking of whether a DAX device is exposed as NUMA node

Even if target_node points an existing NUMA node,
it doesn't mean that specific DAX was added to that node
(for instance, it could be a subpart of it when memmap/efi_fake_addr
boot parameters were used to change memory region attributes).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedhwloc/topology-linux.c (diff)
Commit ed52753dbec61ac3c0257b5c89d24e2caa4c9192 by brice.goglin
linux: add DAXParent and DAXType info attr

Added to both DAX OS devices or their corresponding NUMA nodes.

DAXParent is a string describing the sysfs hierarchy going to the parent device
(contains "hmem" for soft-reserved specific-purpose memory
and "ndbus" for NVDIMMs).

DAXType is either "SPM" or "NVM" for now.
We'll use bandwidth later to detect when SPM is actually HBM.

The "ndbus" subsystem driving these DAX devices is for nvdimms only
right now, but I wouldn't be surprised if non-nvdimm hardware ended
up there in the future too, hence the name "NVM" instead of "NVDIMM".

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedhwloc/topology-linux.c (diff)
The file was modifiedtests/hwloc/linux/32em64t-2n8c+1mic.output (diff)
The file was modifiedtests/hwloc/linux/fakememinitiators-1np2c+1npp+gi.output (diff)
The file was modifieddoc/hwloc.doxy (diff)
The file was modifiedNEWS (diff)
Commit 1bdb4e37a900df7b67d7144fae71b0ae698f849b by brice.goglin
linux/dax: add some comments

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedhwloc/topology-linux.c (diff)
Commit 2bb8dad3346a3a45e59885f1c341401c975c6c28 by brice.goglin
linux/block: replace "NVDIMM" subtype with "NVM" or "SPM" to match DAX attributes

A DAX file may come from a HBM node marked as specific-purpose (SPM).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifieddoc/hwloc.doxy (diff)
The file was modifiedhwloc/topology-linux.c (diff)
The file was modifiedtests/hwloc/linux/32em64t-2n8c+1mic.output (diff)
Commit a82267bcf4d0c220529eff886760b432c7392cf8 by brice.goglin
memattrs: heuristics to set NUMA node subtype to DRAM/HBM/SPM/NVM

Internally, we classify NUMA nodes by tier:
1) "UNKNOWN" (usually DRAM)
2) "SPM" (UEFI name "Specific-Purpose Memory") is what Linux exposes
   as "Soft-Reserved" DAX, usually HBM but could be something else.
3) "NVM" (NVDIMMs already detected on Linux through dax/kmem)
4) "GPU" (NVIDIA-only, already detected on Linux, and exposed with subtype "GPUMemory")

If (2) has 2x higher bandwidth than (1), (2) becomes HBM and (1) become DRAM.
If HWLOC_MEMTIERS_GUESS=spm_is_hbm is set in the environment, we don't even
look a the bandwidth.

In the end, we set DRAM/HBM/NVM to NUMA node subtypes.
We keep SPM if we couldn't guess that SPM was HBM.
DRAM isn't set unless there's anything else in the system.

The heuristics is applied at the end of the topology, even when
loading from XML.

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedtests/hwloc/linux/fakememinitiators-1np2c+1npp+gi.output (diff)
The file was modifiedNEWS (diff)
The file was modifiedinclude/hwloc/rename.h (diff)
The file was modifiedhwloc/topology.c (diff)
The file was modifieddoc/hwloc.doxy (diff)
The file was modifiedinclude/private/private.h (diff)
The file was modifiedhwloc/memattrs.c (diff)
Commit 6cd6c6fad7aaf6006bfbe2fd6f0af9c33c52e1ad by brice.goglin
tests: add memtiers for testing subtypes of heterogeneous memory nodes

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedtests/hwloc/Makefile.am (diff)
The file was addedtests/hwloc/memtiers.c
Commit 526cdf1b7fc87f1f4a3b1e14d675139d65342681 by brice.goglin
tests/linux: add a complex test case with lots of heterogeneous memories

Package0 has DRAM+HBM+NVM but HBM is exposed as DAX (SPM instead of HBM since DAX have no BW info).
Package1 has DRAM+2xHBM+NVM but one HBM is as DAX (SPM instead of HBM).
Package2 has HBM+2xNVM but one NVM is as DAX (NVM).

The case was generated with:
qemu-system-x86_64 -accel kvm \
-machine pc,nvdimm=on,hmat=on \
-drive if=pflash,format=raw,file=$FILES/OVMF.fd \
-drive media=disk,format=qcow2,file=$FILES/efi.qcow2 \
-smp 6 \
-m 6G,slots=4,maxmem=8G \
-object memory-backend-ram,size=3G,id=ram0 \
-object memory-backend-ram,size=1G,id=ram1 \
-object memory-backend-ram,size=512M,id=ram2 \
-object memory-backend-ram,size=512M,id=ram3 \
-object memory-backend-ram,size=512M,id=ram4 \
-object memory-backend-ram,size=512M,id=ram5 \
-numa node,nodeid=0,memdev=ram0,cpus=0-1 \
-numa node,nodeid=1,memdev=ram1,cpus=2-3 \
-numa node,nodeid=2,memdev=ram2,initiator=0 \
-numa node,nodeid=3,memdev=ram3,initiator=1 \
-numa node,nodeid=4,memdev=ram4,initiator=1 \
-numa node,nodeid=5,memdev=ram5,cpus=4-5 \
-numa node,nodeid=6,initiator=0 \
-numa node,nodeid=7,initiator=1 \
-numa node,nodeid=8,initiator=5 \
-numa node,nodeid=9,initiator=5 \
-object memory-backend-file,id=nvdimm1,share=on,mem-path=/tmp/nvdimm1.img,size=512M \
-device nvdimm,id=nvdimm1,memdev=nvdimm1,unarmed=off,node=6 \
-object memory-backend-file,id=nvdimm2,share=on,mem-path=/tmp/nvdimm2.img,size=512M \
-device nvdimm,id=nvdimm2,memdev=nvdimm2,unarmed=off,node=7 \
-object memory-backend-file,id=nvdimm3,share=on,mem-path=/tmp/nvdimm3.img,size=512M \
-device nvdimm,id=nvdimm3,memdev=nvdimm3,unarmed=off,node=8 \
-object memory-backend-file,id=nvdimm4,share=on,mem-path=/tmp/nvdimm4.img,size=512M \
-device nvdimm,id=nvdimm4,memdev=nvdimm4,unarmed=off,node=9
Booted with 5.18 with efi_fake_mem=2G@5G:0x40000 to mark PXM 2-5 as SPM.

Then all NVDIMM namespaces (5.0 to 8.0) are converted to devdax with
ndctl disable-namespace namespaceX.0
ndctl create-namespace -f -e namespaceX.0 -t pmem --mode=devdax
Then some DAX (1.0 3.0 5.0 7.0 and 8.0) are exposed as NUMA nodes with
daxctl reconfigure-device --mode=system-ram daxX.0

Then hwloc-gather-topology and tweak things:
* Remove dax4.0 since it doesn't really exist (it appears because
  we used a single 2G efi_fake_mem= instead of 4 consecutive ones,
  but it actually covers dax0.0-dax3.0).
* Fix the locality of some DAX (in /sys/bus/dax/devices/*/{,..}/numa_node)
  that Qemu doesn't set correctly because we didn't give the HMAT and SLIT
  tables (and initiator= doesn't work very well).
* Then set R/W bandwidth of all nodes (100 for NVM, 1000 for DRAM, 10000 for HBM)
  in /sys/devices/system/node/node?/access1/initiators/*bandwidth
  (possible with Qemu but requires loooooots of useless values on the command-line
   to fill the matrix).

Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
The file was modifiedtests/hwloc/linux/Makefile.am (diff)
The file was addedtests/hwloc/linux/fakeheteromemtiers.tar.bz2
The file was addedtests/hwloc/linux/fakeheteromemtiers.output