Discussion:
Programmatically cache line
(too old to reply)
blubee blubeeme
2017-12-30 07:50:19 UTC
Permalink
Is there some way to programmatically get the CPU cache line sizes on
FreeBSD?
Konstantin Belousov
2017-12-30 08:28:12 UTC
Permalink
Post by blubee blubeeme
Is there some way to programmatically get the CPU cache line sizes on
FreeBSD?
There are, all of them are MD.

On x86, the CPUID instruction leaf 0x1 returns the information in
%ebx register.
Konstantin Belousov
2018-01-01 10:36:55 UTC
Permalink
Post by Konstantin Belousov
Post by blubee blubeeme
Is there some way to programmatically get the CPU cache line sizes on
FreeBSD?
There are, all of them are MD.
On x86, the CPUID instruction leaf 0x1 returns the information in
%ebx register.
Hm, weird. Why don't we extend sysctl to include this info?
For the same reason we do not provide a sysctl to add two integers.
It would be nice to expose this kind of information via VDSO or similar. There are a lot of similar bits of info that people want to use for ifunc and, SVE is going to have a bunch of similar requirements.
Is VDSO a new trendy word ?

ifunc resolvers in usermode on FreeBSD/x86 get four arguments which
are essentially cpu_features / cpu_features2 / cpu_stdext_features /
cpu_stdext_features2. I suspect that only FreeBSD/x86 arches have the
ifunc support, in rtld and coming shortly in kernel.

Recently HW_CAP/HW_CAP2 were added to the ELF auxv, and elf_aux_info(3)
interface exported from libc.

ARM* did not implemented yet the ifunc stubs in rtld. I believe this is
considered a low priority because there is no ready to use toolchain
which allow to utilize ifuncs on FreeBSD, except if you use recent bfd
ld externally.
Ian Lepore
2018-01-01 16:26:29 UTC
Permalink
Post by Konstantin Belousov
Post by Konstantin Belousov
Post by blubee blubeeme
Is there some way to programmatically get the CPU cache line sizes on
FreeBSD?
There are, all of them are MD.
On x86, the CPUID instruction leaf 0x1 returns the information in
%ebx register.
Hm, weird. Why don't we extend sysctl to include this info?
For the same reason we do not provide a sysctl to add two integers.
It would be nice to expose this kind of information via VDSO or similar.  There are a lot of similar bits of info that people want to use for ifunc and, SVE is going to have a bunch of similar requirements.
Is VDSO a new trendy word ?
ifunc resolvers in usermode on FreeBSD/x86 get four arguments which
are essentially cpu_features / cpu_features2 / cpu_stdext_features /
cpu_stdext_features2.  I suspect that only FreeBSD/x86 arches have the
ifunc support, in rtld and coming shortly in kernel.
Recently HW_CAP/HW_CAP2 were added to the ELF auxv, and elf_aux_info(3)
interface exported from libc.
ARM* did not implemented yet the ifunc stubs in rtld. I believe this is
considered a low priority because there is no ready to use toolchain
which allow to utilize ifuncs on FreeBSD, except if you use recent bfd
ld externally.
Linux exports this info using getauxval(). I think we should support
getauxval() and as many of the AT_* values that linux defines as makes
sense for us to do.

I think it was a mistake to give our version of the function a
different name and different semantics, but this is something that
affects mainly ports, and I don't yet have enough info to make the case
that being linux-compatible will ease porting rather than complicate it
(in some cases, patches will be needed either way).

-- Ian
blubee blubeeme
2018-01-02 01:27:01 UTC
Permalink
Post by blubee blubeeme
Post by Konstantin Belousov
Post by Konstantin Belousov
Post by blubee blubeeme
Is there some way to programmatically get the CPU cache line
sizes on
Post by Konstantin Belousov
Post by Konstantin Belousov
Post by blubee blubeeme
FreeBSD?
There are, all of them are MD.
On x86, the CPUID instruction leaf 0x1 returns the information in
%ebx register.
Hm, weird. Why don't we extend sysctl to include this info?
For the same reason we do not provide a sysctl to add two integers.
It would be nice to expose this kind of information via VDSO or
similar. There are a lot of similar bits of info that people want to use
for ifunc and, SVE is going to have a bunch of similar requirements.
Post by Konstantin Belousov
Is VDSO a new trendy word ?
ifunc resolvers in usermode on FreeBSD/x86 get four arguments which
are essentially cpu_features / cpu_features2 / cpu_stdext_features /
cpu_stdext_features2. I suspect that only FreeBSD/x86 arches have the
ifunc support, in rtld and coming shortly in kernel.
Recently HW_CAP/HW_CAP2 were added to the ELF auxv, and elf_aux_info(3)
interface exported from libc.
ARM* did not implemented yet the ifunc stubs in rtld. I believe this is
considered a low priority because there is no ready to use toolchain
which allow to utilize ifuncs on FreeBSD, except if you use recent bfd
ld externally.
Linux exports this info using getauxval(). I think we should support
getauxval() and as many of the AT_* values that linux defines as makes
sense for us to do.
I think it was a mistake to give our version of the function a
different name and different semantics, but this is something that
affects mainly ports, and I don't yet have enough info to make the case
that being linux-compatible will ease porting rather than complicate it
(in some cases, patches will be needed either way).
-- Ian
FreeBSD implements hardware specific atomic instructions [man atomic] or
look at: #include <machine/atomic.h>

but implementing something that returns size of cache lines is somehow out
of the question?

If you're working with atomic data structures and want to ensure there's no
false sharing the
simplest method I know is to put some padding that's sizeof(cache_line) -
sizeof(data_members)
so that you can try to get them to live on different cache line.

Do we have to go in and write inline assembly to grab the size of the cache
line or wouldn't it
be simpler to have atomic.h return this info?
Nathan Whitehorn
2018-01-03 22:12:00 UTC
Permalink
Post by Konstantin Belousov
Post by Konstantin Belousov
On x86, the CPUID instruction leaf 0x1 returns the information in
%ebx register.
Hm, weird. Why don't we extend sysctl to include this info?
For the same reason we do not provide a sysctl to add two integers.
I strongly agree with Kostik on this one. Why add stuff to the kernel,
if userspace is already capable of extracting this? Adding that stuff
to sysctl has the downside that it will effectively introduce yet
another FreeBSDism, whereas something generic already exists.
Well, kind of. The userspace version is platform-dependent and not
always available: for example, on PPC, you can't do this from userland
and we provide a sysctl machdep.cacheline_size to userland. It would be
nice to have an MI API.
-Nathan
David Chisnall
2018-01-04 10:03:32 UTC
Permalink
Post by Konstantin Belousov
Post by Konstantin Belousov
On x86, the CPUID instruction leaf 0x1 returns the information in
%ebx register.
Hm, weird. Why don't we extend sysctl to include this info?
For the same reason we do not provide a sysctl to add two integers.
I strongly agree with Kostik on this one. Why add stuff to the kernel,
if userspace is already capable of extracting this? Adding that stuff
to sysctl has the downside that it will effectively introduce yet
another FreeBSDism, whereas something generic already exists.
Well, kind of. The userspace version is platform-dependent and not always available: for example, on PPC, you can't do this from userland and we provide a sysctl machdep.cacheline_size to userland. It would be nice to have an MI API.
On ARMv8, similarly, sometimes the kernel needs to advertise the wrong size. A few big.LITTLE cores have 64-byte cache lines on one cluster and 32-byte on the other. If you query the size from userspace while running on a 64-byte cluster, then issue the zero-cache-line instruction while migrated to the 32-byte cluster, you only clear half the size. Linux works around this by trapping and emulating the instruction to query the cache size and always reporting the size for the smallest cache lines. ARM tells people not to build systems like this, but it doesn’t always stop them. Trapping and emulating is much slower than just providing the information in a shared page, elf aux args vector, or even (often) a system call.

To give another example, Linux provides a very cheap way for a userspace process to enquire which core it’s running on. Some more recent high-performance mallocs use this to have a second-layer per-core cache after the per-thread cache for free blocks. Unlike the per-thread cache, the per-core cache does need a lock, but it’s very unlikely to be contended (it will only be contended if either a thread is migrated in between checking and locking, so acquires the wrong CPU’s lock, or if a thread is preempted in the middle of middle of the very brief fill operation). The author of the SuperMalloc paper tried doing this with CPUID and found that it was slower by a sufficient margin to almost entirely offset the benefits of the extra layer of caching.

Just because userspace can get at the information directly from the hardware doesn’t mean that this is the most efficient or best way for userspace to get at it.

Oh, and some of these things are useful in portable code, so having to write some assembly for every target to get information that the kernel already knows is wasteful.

David
Konstantin Belousov
2018-01-04 18:29:40 UTC
Permalink
Post by Konstantin Belousov
Post by Konstantin Belousov
On x86, the CPUID instruction leaf 0x1 returns the information in
%ebx register.
Hm, weird. Why don't we extend sysctl to include this info?
For the same reason we do not provide a sysctl to add two integers.
I strongly agree with Kostik on this one. Why add stuff to the kernel,
if userspace is already capable of extracting this? Adding that stuff
to sysctl has the downside that it will effectively introduce yet
another FreeBSDism, whereas something generic already exists.
Well, kind of. The userspace version is platform-dependent and not always available: for example, on PPC, you can't do this from userland and we provide a sysctl machdep.cacheline_size to userland. It would be nice to have an MI API.
On ARMv8, similarly, sometimes the kernel needs to advertise the wrong size. A few big.LITTLE cores have 64-byte cache lines on one cluster and 32-byte on the other. If you query the size from userspace while running on a 64-byte cluster, then issue the zero-cache-line instruction while migrated to the 32-byte cluster, you only clear half the size. Linux works around this by trapping and emulating the instruction to query the cache size and always reporting the size for the smallest cache lines. ARM tells people not to build systems like this, but it doesn???t always stop them. Trapping and emulating is much slower than just providing the information in a shared page, elf aux args vector, or even (often) a system call.
Of course MD way is the best way to get such information, just because the
meaning of the 'cache line size' exists only in context of the given CPU
(micro)architecture. For instance, on PowerPC and ARM you are often concerned
with the granularity of the instruction cache flush, but also you might be
concerned with the DMA, and these are different concepts of cache.

Even on x86, you may care about alignment to avoid false sharing or
about CLFLUSH granularity, and these can be different legitimately.
Which one to report as 'cache line' ?

And you cannot bail out with the max among all constants, because sometimes
you really need the min size (for CLFLUSH), and sometime max size (for
false sharing).
To give another example, Linux provides a very cheap way for a userspace process to enquire which core it???s running on. Some more recent high-performance mallocs use this to have a second-layer per-core cache after the per-thread cache for free blocks. Unlike the per-thread cache, the per-core cache does need a lock, but it???s very unlikely to be contended (it will only be contended if either a thread is migrated in between checking and locking, so acquires the wrong CPU???s lock, or if a thread is preempted in the middle of middle of the very brief fill operation). The author of the SuperMalloc paper tried doing this with CPUID and found that it was slower by a sufficient margin to almost entirely offset the benefits of the extra layer of caching.
There, RDTSCP is the intended way to get cpu id in userspace, but the use
of this instruction requires some minimal OS support. It should be faster
than CPUID, since it is not fully serializing. We do not support it only
because nobody asked so far.
Just because userspace can get at the information directly from the hardware doesn???t mean that this is the most efficient or best way for userspace to get at it.
It depends, but single instruction (!) vs syscall comparision makes this
discussion silly.
Oh, and some of these things are useful in portable code, so having to write some assembly for every target to get information that the kernel already knows is wasteful.
Required work is to provide the definitions of these interfaces, then they
can be implemented in the best way for each architecture. But nobody did
that.
blubee blubeeme
2018-01-04 18:38:12 UTC
Permalink
This post might be inappropriate. Click to display it.
David Chisnall
2018-01-05 09:39:50 UTC
Permalink
This idea of Arm big.LITTLE systems having cache lines of different lengths really, really bothers me - how on earth is the cache coherency supposed to work in such a system? I doubt the usual cache coherency protocols would work - probably need a really MESSY protocol to deal with this config :-)
I believe that the systems that have different cache line sizes (which ARM explicitly tells partners not to do) don’t allow cores from both the big and little clusters to be active at the same time - the OS is supposed to migrate everything entirely from one cluster to the other. The more complex designs, that allow mixes of cores from two or three different clusters that I’m aware of all have the same cache line size.

David

Continue reading on narkive:
Loading...