Post by Andriy GaponPost by Bryan DreweryPost by Peter JeremyPost by Jeff RobersonAlso, if you could try going back to r328953 or r326346 and let me know if
the problem exists in either. That would be very helpful. If anyone is
willing to debug this with me contact me directly and I will send some
test patches or debugging info after you have done the above steps.
I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.
First, it looks like maybe several different issues are being discussed and
possibly conflated in this thread. I see reports related to ZFS, I see reports
where ZFS is not used at all. Some people report problems that appeared very
recently while others chime in with "yes, yes, I've always had this problem".
This does not help to differentiate between problems and to analyze them.
Post by Bryan DreweryLooking at the ARC change you referred to from r325851
https://reviews.freebsd.org/D12163, I am convinced that ARC backpressure
is completely broken.
Does your being convinced come from the code review or from experiments?
If the former, could you please share your analysis?
Post by Bryan DreweryOn my 78GB RAM system with ARC limited to 40GB and
doing a poudriere build of all LLVM and GCC packages at once in tmpfs I
can get swap up near 50GB and yet the ARC remains at 40GB through it
all. It's always been slow to give up memory for package builds but it
really seems broken right now.
Well, there are multiple angles. Maybe it's ARC that does not react properly,
or maybe it's not being asked properly.
It would be useful to monitor the system during its transition to the state that
you reported. There are some interesting DTrace probes in ARC, specifically
arc-available_memory and arc-needfree are first that come to mind. Their
parameters and how frequently they are called are of interest. VM parameters
could be of interest as well.
A rant.
Basically, posting some numbers and jumping to conclusions does not help at all.
Monitoring, graphing, etc does help.
ARC is a complex dynamic system.
VM (pagedaemon, UMA caches) is a complex dynamic system.
They interact in a complex dynamic ways.
Sometimes a change in ARC is incorrect and requires a fix.
Sometimes a change in VM is incorrect and requires a fix.
Sometimes a change in VM requires a change in ARC.
These three kinds of problems can all appear as a "problem with ARC".
For instance, when vm.lowmem_period was introduced you wouldn't find any mention
of ZFS/ARC. But it does affect period between arc_lowmem() calls.
Also, pin-pointing a specific commit requires proper bisecting and proper
testing to correctly attribute systemic behavior changes to code changes.
I just upgraded my main package build box (12.0-CURRENT, 8 cores, 32 GB
RAM) from r327616 to r331716. I was seeing higher swap usage and larger
ARC sizes before the upgrade than I remember from the distant past, but
ARC was still at least somewhat responsive to memory pressure and I
didn't notice any performance issues.
After the upgrade, ARC size seems to be pretty unresponsive to memory
demand. Currently the machine is near the end of a poudriere run to
build my usual set of ~1800 ports. The only currently running build is
chromium and the machine is paging heavily. Settings of interest are:
USE_TMPFS="wrkdir data localbase"
ALLOW_MAKE_JOBS=yes
last pid: 96239; load averages: 1.86, 1.76, 1.83 up 3+14:47:00 12:38:11
108 processes: 3 running, 105 sleeping
CPU: 18.6% user, 0.0% nice, 2.4% system, 0.0% interrupt, 79.0% idle
Mem: 129M Active, 865M Inact, 61M Laundry, 29G Wired, 1553K Buf, 888M Free
ARC: 23G Total, 8466M MFU, 10G MRU, 5728K Anon, 611M Header, 3886M Other
17G Compressed, 32G Uncompressed, 1.88:1 Ratio
Swap: 40G Total, 17G Used, 23G Free, 42% Inuse, 4756K In
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAN
96239 nobody 1 76 0 140M 93636K CPU5 5 0:01 82.90% clang-
96238 nobody 1 75 0 140M 92608K CPU7 7 0:01 80.81% clang-
5148 nobody 1 20 0 590M 113M swread 0 0:31 0.29% clang-
57290 root 1 20 0 12128K 2608K zio->i 7 8:11 0.28% find
78958 nobody 1 20 0 838M 299M swread 0 0:23 0.19% clang-
97840 nobody 1 20 0 698M 140M swread 4 0:27 0.13% clang-
96066 nobody 1 20 0 463M 104M swread 1 0:32 0.12% clang-
11050 nobody 1 20 0 892M 154M swread 2 0:39 0.12% clang-
Pre-upgrade I was running r327616, which is newer than either of the
commits that Jeff mentioned above. It seems like there has been a
regression since then.
I also don't recall seeing this problem on my Ryzen box, though it has
2x the core count and 2x the RAM. The last testing that I did on it was
with r329844.