|
| hwloc/topology-linux.c (diff) |
Commit
c413eb14fe9744ecfa84d58c285dba909da617f2
by brice.goglinlinux/io: remove support for /sys/class/dax
DAX devices moved to /sys/bus/dax soon after their support were added in Linux, distributions disabled it and now it's removed from latest kernels. Also /sys/class/dax misses many features, such as kmem driver for exposing as NUMA nodes, and hmem driver for special-purpose memory. So don't bother supporting it.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| hwloc/topology-linux.c (diff) |
Commit
b4b352d0ca08ee6ba9ff83c8ab85f4f3ba3ea2ef
by brice.goglinlinux/io: cleanup dax/non-dax devtype management
Clarify the difference between "class" and "bus" devices, which explains why we follow the "device" symlink for the former and not the latter.
And then clarify why we look at the parent for DAX bus devices.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| hwloc/topology-linux.c (diff) |
Commit
48673bc74b8435631325448224508cbf6d5b51ae
by brice.goglinlinux/io: no need for sysfs local_cpus for OSdev locality
Most OSdev have a numa_node attribute in their sysfs hierarchy, but local_cpus is only available PCI sysfs devices.
OSdev that are related to a PCIdev are attached below that PCIdev. We don't need to look at numa_node or local_cpus here. The locality of the PCIdev was obtained earlier by looking at local_cpus in hwloc_linux_backend_get_pci_busid_cpuset(). numa_node is also available (since 2007) but not used.
OSdev that are NOT related to a PCIdev have no relevant local_cpus to look at, hence this code was never used.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| hwloc/topology-linux.c (diff) |
Commit
81b0bf3487698d6fbaa4f90f344712dc1a993e1b
by brice.goglinlinux/io: rework/fix numa_node attribute in sysfs
For class devices (from /sys/class/foo), the actual device is pointed by the "device" symlink, there's usually a numa_node attribute in there.
For bus devices (from /sys/bus/foo/devices), numa_node can be in the device itself (hmem DAX devices in some Suse 5.3 kernel), or in its parent (old DAX devices in pre-5.5 kernels), or in both (newer kernels).
The previous code only supported the parent case. We now use the device itself first for bus devices (enough in the vast majority of cases), and then the parent if a flag was given (for DAX on some old kernels).
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| hwloc/topology-linux.c (diff) |
|
| utils/hwloc/hwloc-gather-topology.in (diff) |
|
| tests/hwloc/linux/fakememinitiators-1np2c+1npp+gi.tar.bz2 (diff) |
| tests/hwloc/linux/32em64t-2n8c+1mic.tar.bz2 (diff) |
Commit
161dbb2c18c5fbf5faceffeb1b0e11134cfdac15
by brice.goglinlinux: fix and factorize the checking of whether a DAX device is exposed as NUMA node
Even if target_node points an existing NUMA node, it doesn't mean that specific DAX was added to that node (for instance, it could be a subpart of it when memmap/efi_fake_addr boot parameters were used to change memory region attributes).
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| hwloc/topology-linux.c (diff) |
Commit
ed52753dbec61ac3c0257b5c89d24e2caa4c9192
by brice.goglinlinux: add DAXParent and DAXType info attr
Added to both DAX OS devices or their corresponding NUMA nodes.
DAXParent is a string describing the sysfs hierarchy going to the parent device (contains "hmem" for soft-reserved specific-purpose memory and "ndbus" for NVDIMMs).
DAXType is either "SPM" or "NVM" for now. We'll use bandwidth later to detect when SPM is actually HBM.
The "ndbus" subsystem driving these DAX devices is for nvdimms only right now, but I wouldn't be surprised if non-nvdimm hardware ended up there in the future too, hence the name "NVM" instead of "NVDIMM".
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| hwloc/topology-linux.c (diff) |
| tests/hwloc/linux/32em64t-2n8c+1mic.output (diff) |
| tests/hwloc/linux/fakememinitiators-1np2c+1npp+gi.output (diff) |
| doc/hwloc.doxy (diff) |
| NEWS (diff) |
|
| hwloc/topology-linux.c (diff) |
Commit
2bb8dad3346a3a45e59885f1c341401c975c6c28
by brice.goglinlinux/block: replace "NVDIMM" subtype with "NVM" or "SPM" to match DAX attributes
A DAX file may come from a HBM node marked as specific-purpose (SPM).
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| doc/hwloc.doxy (diff) |
| hwloc/topology-linux.c (diff) |
| tests/hwloc/linux/32em64t-2n8c+1mic.output (diff) |
Commit
a82267bcf4d0c220529eff886760b432c7392cf8
by brice.goglinmemattrs: heuristics to set NUMA node subtype to DRAM/HBM/SPM/NVM
Internally, we classify NUMA nodes by tier: 1) "UNKNOWN" (usually DRAM) 2) "SPM" (UEFI name "Specific-Purpose Memory") is what Linux exposes as "Soft-Reserved" DAX, usually HBM but could be something else. 3) "NVM" (NVDIMMs already detected on Linux through dax/kmem) 4) "GPU" (NVIDIA-only, already detected on Linux, and exposed with subtype "GPUMemory")
If (2) has 2x higher bandwidth than (1), (2) becomes HBM and (1) become DRAM. If HWLOC_MEMTIERS_GUESS=spm_is_hbm is set in the environment, we don't even look a the bandwidth.
In the end, we set DRAM/HBM/NVM to NUMA node subtypes. We keep SPM if we couldn't guess that SPM was HBM. DRAM isn't set unless there's anything else in the system.
The heuristics is applied at the end of the topology, even when loading from XML.
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| tests/hwloc/linux/fakememinitiators-1np2c+1npp+gi.output (diff) |
| NEWS (diff) |
| include/hwloc/rename.h (diff) |
| hwloc/topology.c (diff) |
| doc/hwloc.doxy (diff) |
| include/private/private.h (diff) |
| hwloc/memattrs.c (diff) |
|
| tests/hwloc/Makefile.am (diff) |
| tests/hwloc/memtiers.c |
Commit
526cdf1b7fc87f1f4a3b1e14d675139d65342681
by brice.goglintests/linux: add a complex test case with lots of heterogeneous memories
Package0 has DRAM+HBM+NVM but HBM is exposed as DAX (SPM instead of HBM since DAX have no BW info). Package1 has DRAM+2xHBM+NVM but one HBM is as DAX (SPM instead of HBM). Package2 has HBM+2xNVM but one NVM is as DAX (NVM).
The case was generated with: qemu-system-x86_64 -accel kvm \ -machine pc,nvdimm=on,hmat=on \ -drive if=pflash,format=raw,file=$FILES/OVMF.fd \ -drive media=disk,format=qcow2,file=$FILES/efi.qcow2 \ -smp 6 \ -m 6G,slots=4,maxmem=8G \ -object memory-backend-ram,size=3G,id=ram0 \ -object memory-backend-ram,size=1G,id=ram1 \ -object memory-backend-ram,size=512M,id=ram2 \ -object memory-backend-ram,size=512M,id=ram3 \ -object memory-backend-ram,size=512M,id=ram4 \ -object memory-backend-ram,size=512M,id=ram5 \ -numa node,nodeid=0,memdev=ram0,cpus=0-1 \ -numa node,nodeid=1,memdev=ram1,cpus=2-3 \ -numa node,nodeid=2,memdev=ram2,initiator=0 \ -numa node,nodeid=3,memdev=ram3,initiator=1 \ -numa node,nodeid=4,memdev=ram4,initiator=1 \ -numa node,nodeid=5,memdev=ram5,cpus=4-5 \ -numa node,nodeid=6,initiator=0 \ -numa node,nodeid=7,initiator=1 \ -numa node,nodeid=8,initiator=5 \ -numa node,nodeid=9,initiator=5 \ -object memory-backend-file,id=nvdimm1,share=on,mem-path=/tmp/nvdimm1.img,size=512M \ -device nvdimm,id=nvdimm1,memdev=nvdimm1,unarmed=off,node=6 \ -object memory-backend-file,id=nvdimm2,share=on,mem-path=/tmp/nvdimm2.img,size=512M \ -device nvdimm,id=nvdimm2,memdev=nvdimm2,unarmed=off,node=7 \ -object memory-backend-file,id=nvdimm3,share=on,mem-path=/tmp/nvdimm3.img,size=512M \ -device nvdimm,id=nvdimm3,memdev=nvdimm3,unarmed=off,node=8 \ -object memory-backend-file,id=nvdimm4,share=on,mem-path=/tmp/nvdimm4.img,size=512M \ -device nvdimm,id=nvdimm4,memdev=nvdimm4,unarmed=off,node=9 Booted with 5.18 with efi_fake_mem=2G@5G:0x40000 to mark PXM 2-5 as SPM.
Then all NVDIMM namespaces (5.0 to 8.0) are converted to devdax with ndctl disable-namespace namespaceX.0 ndctl create-namespace -f -e namespaceX.0 -t pmem --mode=devdax Then some DAX (1.0 3.0 5.0 7.0 and 8.0) are exposed as NUMA nodes with daxctl reconfigure-device --mode=system-ram daxX.0
Then hwloc-gather-topology and tweak things: * Remove dax4.0 since it doesn't really exist (it appears because we used a single 2G efi_fake_mem= instead of 4 consecutive ones, but it actually covers dax0.0-dax3.0). * Fix the locality of some DAX (in /sys/bus/dax/devices/*/{,..}/numa_node) that Qemu doesn't set correctly because we didn't give the HMAT and SLIT tables (and initiator= doesn't work very well). * Then set R/W bandwidth of all nodes (100 for NVM, 1000 for DRAM, 10000 for HBM) in /sys/devices/system/node/node?/access1/initiators/*bandwidth (possible with Qemu but requires loooooots of useless values on the command-line to fill the matrix).
Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr>
|
| tests/hwloc/linux/Makefile.am (diff) |
| tests/hwloc/linux/fakeheteromemtiers.tar.bz2 |
| tests/hwloc/linux/fakeheteromemtiers.output |