Strange ARC/Swap/CPU on yesterday's -CURRENT

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

I see these symptoms on stable/11. One of my servers has 32 GiB of
RAM. After a reboot all is well. ARC starts to fill up, and I still
have more than half of the memory available for user processes.

After running the periodic jobs at night, the amount of wired memory
goes sky high. /etc/periodic/weekly/310.locate is a particular nasty
one.

Limiting the ARC to, say, 16 GiB, has no effect of the high amount of
wired memory. After a few more days, the kernel consumes virtually all
memory, forcing processes in and out of the swap device.

stable/10 never exhibited these symptoms, even with ZFS.

I had hoped the kernel would manage its memory usage more wisely, but
maybe it's time to set some hard limits on the kernel.

Last year, I experienced deadlocks on stable/11 systems running ZFS
with only 1 GiB of RAM. periodic(8) and clang jobs would never be
rescheduled, they just sat there doing nothing halfway through their
mission and with most of their pages on the swap device. I was lucky
enough to be able to log in and reboot the damned servers. I installed
8 GiB of memory in each server and I never saw any deadlocks since.

Maybe we should try and help by run (virtual) machines with low
amounts of memory and high loads to weed out these bugs, if they still
persist.

--
Trond.

Stefan Esser

2018-03-06 10:04:46 UTC

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png
Ideas?

I'm seeing the same, and currently work around this with a reasonably limited
vfs.zfs.arc_max.

Without such a limit I see (on a system with 24 GB RAM):

CPU: 0.3% user, 0.0% nice, 0.9% system, 0.1% interrupt, 98.8% idle
Mem: 14M Active, 1228K Inact, 32K Laundry, 23G Wired, 376M Free
ARC: 19G Total, 3935M MFU, 14G MRU, 82M Anon, 223M Header, 876M Other
18G Compressed, 36G Uncompressed, 2.02:1 Ratio
Swap: 24G Total, 888M Used, 23G Free, 3% Inuse, 8892K In, 5136K Out

sysctl vfs.zfs.arc_max=15988656640 results in:

Mem: 129M Active, 72M Inact, 36K Laundry, 18G Wired, 5149M Free
ARC: 15G Total, 3997M MFU, 10G MRU, 40M Anon, 205M Header, 877M Other
13G Compressed, 28G Uncompressed, 2.08:1 Ratio
Swap: 24G Total, 796M Used, 23G Free, 3% Inuse, 16K In

The system was mostly idle at both times, just some Samba traffic and
mail being checked by spamassassin. And I noticed it (this time) when
the spamassassin processes were aborted due to a time limit.

I think that this problem must have been introduced in the last few
weeks, but cannot give a better estimate (do not reboot that often).

But I had already applied the arc_max setting a week ago (and had not
put it in sysctl.conf in the hope that the ARC growth was a temporary
problem in the ZFS code, soon to be fixed ...).

Regards, STefan

Rodney W. Grimes

2018-03-06 16:40:10 UTC

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

I would like to find out if this is the same person I have
reporting this problem from another source, or if this is
a confirmation of a bug I was helping someone else with.

Have you been in contact with Michael Dexter about this
issue, or any other forum/mailing list/etc?

If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

Post by Trond EndrestÃ¸l
Limiting the ARC to, say, 16 GiB, has no effect of the high amount of
wired memory. After a few more days, the kernel consumes virtually all
memory, forcing processes in and out of the swap device.

Our experience as well.

...

Thanks,

--
Rod Grimes ***@freebsd.org

Larry Rosenman

2018-03-06 17:34:56 UTC

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

Just IRC/Slack, with no response.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

What can I provide? The system is still in this state as the full backup is slow.

Our experience as well.
...
Thanks,
--

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ***@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106

Trond Endrestøl

2018-03-07 07:07:34 UTC

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

No, it wasn't me.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

Our experience as well.

--
Trond.

Rodney W. Grimes

2018-03-06 18:16:36 UTC

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

Just IRC/Slack, with no response.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

What can I provide? The system is still in this state as the full backup is slow.

One place to look is to see if this is the recently fixed:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222288
g_bio leak.

vmstat -z | egrep 'ITEM|g_bio|UMA'

would be a good first look

Our experience as well.
...
Thanks,

Larry Rosenman http://www.lerctr.org/~ler

--
Rod Grimes ***@freebsd.org

Larry Rosenman

2018-03-06 19:36:45 UTC

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

Just IRC/Slack, with no response.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

What can I provide? The system is still in this state as the full backup is slow.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222288
g_bio leak.
vmstat -z | egrep 'ITEM|g_bio|UMA'
would be a good first look

borg.lerctr.org /home/ler $ vmstat -z | egrep 'ITEM|g_bio|UMA'
ITEM SIZE LIMIT USED FREE REQ FAIL SLEEP
UMA Kegs: 280, 0, 346, 5, 560, 0, 0
UMA Zones: 1928, 0, 363, 1, 577, 0, 0
UMA Slabs: 112, 0,25384098, 977762,102033225, 0, 0
UMA Hash: 256, 0, 59, 16, 105, 0, 0
g_bio: 384, 0, 33, 1627,542482056, 0, 0
borg.lerctr.org /home/ler $

Our experience as well.
...
Thanks,

Larry Rosenman http://www.lerctr.org/~ler

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ***@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106

Danilo G. Baio

2018-03-06 22:15:54 UTC

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

Just IRC/Slack, with no response.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

What can I provide? The system is still in this state as the full backup is slow.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222288
g_bio leak.
vmstat -z | egrep 'ITEM|g_bio|UMA'
would be a good first look

Our experience as well.
...
Thanks,

Larry Rosenman http://www.lerctr.org/~ler

--
Larry Rosenman http://www.lerctr.org/~ler
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106

Hi.

I noticed this behavior as well and changed vfs.zfs.arc_max for a smaller size.

For me it started when I upgraded to 1200058, in this box I'm only using
poudriere for building tests.

Regards.

--
Danilo G. Baio (dbaio)

Roman Bogorodskiy

2018-03-07 10:39:13 UTC

Post by Larry Rosenman
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

Just IRC/Slack, with no response.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

What can I provide? The system is still in this state as the full backup is slow.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222288
g_bio leak.
vmstat -z | egrep 'ITEM|g_bio|UMA'
would be a good first look

Our experience as well.
...
Thanks,

Larry Rosenman http://www.lerctr.org/~ler

--
Larry Rosenman http://www.lerctr.org/~ler
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106

Hi.
I noticed this behavior as well and changed vfs.zfs.arc_max for a smaller size.
For me it started when I upgraded to 1200058, in this box I'm only using
poudriere for building tests.

I've noticed that as well.

I have 16G of RAM and two disks, the first one is UFS with the system
installation and the second one is ZFS which I use to store media and
data files and for poudreire.

I don't recall the exact date, but it started fairly recently. System would
swap like crazy to a point when I cannot even ssh to it, and can hardly
login through tty: it might take 10-15 minutes to see a command typed in
the shell.

I've updated loader.conf to have the following:

vfs.zfs.arc_max="4G"
vfs.zfs.prefetch_disable="1"

It fixed the problem, but introduced a new one. When I'm building stuff
with poudriere with ccache enabled, it takes hours to build even small
projects like curl or gnutls.

For example, current build:

[10i386-default] [2018-03-07_07h44m45s] [parallel_build:] Queued: 3 Built: 1 Failed: 0 Skipped: 0 Ignored: 0 Tobuild: 2 Time: 06:48:35
[02]: security/gnutls | gnutls-3.5.18 build (06:47:51)

Almost 7 hours already and still going!

gstat output looks like this:

dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0 da0
0 1 0 0 0.0 1 128 0.7 0.1 ada0
1 106 106 439 64.6 0 0 0.0 98.8 ada1
0 1 0 0 0.0 1 128 0.7 0.1 ada0s1
0 0 0 0 0.0 0 0 0.0 0.0 ada0s1a
0 0 0 0 0.0 0 0 0.0 0.0 ada0s1b
0 1 0 0 0.0 1 128 0.7 0.1 ada0s1d

ada0 here is UFS driver, and ada1 is ZFS.

Post by Danilo G. Baio
Regards.
--
Danilo G. Baio (dbaio)

Roman Bogorodskiy

O. Hartmann

2018-03-10 23:47:10 UTC

Am Wed, 7 Mar 2018 14:39:13 +0400

Post by Roman Bogorodskiy

Post by Larry Rosenman
Sun Mar 4 12:48:52 CST 2018
+1200060 1200060
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use
and swapping.
See http://www.lerctr.org/~ler/FreeBSD/Swapuse.png

Just IRC/Slack, with no response.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

What can I provide? The system is still in this state as the full backup is
slow.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222288
g_bio leak.
vmstat -z | egrep 'ITEM|g_bio|UMA'
would be a good first look

Our experience as well.
...
Thanks,
Rod Grimes

Larry Rosenman http://www.lerctr.org/~ler

--
Larry Rosenman http://www.lerctr.org/~ler
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106

Hi.
I noticed this behavior as well and changed vfs.zfs.arc_max for a smaller size.
For me it started when I upgraded to 1200058, in this box I'm only using
poudriere for building tests.

I've noticed that as well.
I have 16G of RAM and two disks, the first one is UFS with the system
installation and the second one is ZFS which I use to store media and
data files and for poudreire.
I don't recall the exact date, but it started fairly recently. System would
swap like crazy to a point when I cannot even ssh to it, and can hardly
login through tty: it might take 10-15 minutes to see a command typed in
the shell.
vfs.zfs.arc_max="4G"
vfs.zfs.prefetch_disable="1"
It fixed the problem, but introduced a new one. When I'm building stuff
with poudriere with ccache enabled, it takes hours to build even small
projects like curl or gnutls.
0 Skipped: 0 Ignored: 0 Tobuild: 2 Time: 06:48:35 [02]: security/gnutls
| gnutls-3.5.18 build (06:47:51)
Almost 7 hours already and still going!
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0 da0
0 1 0 0 0.0 1 128 0.7 0.1 ada0
1 106 106 439 64.6 0 0 0.0 98.8 ada1
0 1 0 0 0.0 1 128 0.7 0.1 ada0s1
0 0 0 0 0.0 0 0 0.0 0.0 ada0s1a
0 0 0 0 0.0 0 0 0.0 0.0 ada0s1b
0 1 0 0 0.0 1 128 0.7 0.1 ada0s1d
ada0 here is UFS driver, and ada1 is ZFS.

Post by Danilo G. Baio
Regards.
--
Danilo G. Baio (dbaio)

Roman Bogorodskiy

This is from a APU, no ZFS, UFS on a small mSATA device, the APU (PCenigine) works as a
firewall, router, PBX):

last pid: 9665; load averages: 0.13, 0.13, 0.11
up 3+06:53:55 00:26:26 19 processes: 1 running, 18 sleeping CPU: 0.3% user, 0.0%
nice, 0.2% system, 0.0% interrupt, 99.5% idle Mem: 27M Active, 6200K Inact, 83M
Laundry, 185M Wired, 128K Buf, 675M Free Swap: 7808M Total, 2856K Used, 7805M Free
[...]

The APU is running CURRENT ( FreeBSD 12.0-CURRENT #42 r330608: Wed Mar 7 16:55:59 CET
2018 amd64). Usually, the APU never(!) uses swap, now it is starting to swap like hell
for a couple of days and I have to reboot it failty often.

Another box, 16 GB RAM, ZFS, poudriere, the packaging box, is right now unresponsible:
after hours of building packages, I tried to copy the repository from one location on
the same ZFS volume to another - usually this task takes a couple of minutes for ~ 2200
ports. Now, I has taken 2 1/2 hours and the box got stuck, Ctrl-T on the console
delivers:
load: 0.00 cmd: make 91199 [pfault] 7239.56r 0.03u 0.04s 0% 740k

No response from the box anymore.

The problem of swapping like hell and performing slow isn't an issue of the past days, it
is present at least since 1 1/2 weeks for now, even more. Since I build ports fairly
often, time taken on that specific box has increased from 2 to 3 days for all ~2200
ports. The system has 16 GB of RAM, IvyBridge 4-core XEON at 3,4 GHz, if this information
matters. The box is consuming swap really fast.

Today is the first time the machine got inresponsible (no ssh, no console login so far).
Need to coldstart. OS is CURRENT as well.

Regards,

O. Hartmann

--
O. Hartmann

Ich widerspreche der Nutzung oder Ãbermittlung meiner Daten fÃŒr
Werbezwecke oder fÃŒr die Markt- oder Meinungsforschung (Â§ 28 Abs. 4 BDSG).

Michael Gmelin

2018-03-11 00:40:58 UTC

Post by O. Hartmann
Am Wed, 7 Mar 2018 14:39:13 +0400

Post by Roman Bogorodskiy

Just IRC/Slack, with no response.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

What can I provide? The system is still in this state as the full backup is
slow.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222288
g_bio leak.
vmstat -z | egrep 'ITEM|g_bio|UMA'
would be a good first look

Our experience as well.
...
Thanks,
Rod Grimes

Larry Rosenman http://www.lerctr.org/~ler

--
Larry Rosenman http://www.lerctr.org/~ler
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106

Hi.
I noticed this behavior as well and changed vfs.zfs.arc_max for a smaller size.
For me it started when I upgraded to 1200058, in this box I'm only using
poudriere for building tests.

I've noticed that as well.
I have 16G of RAM and two disks, the first one is UFS with the system
installation and the second one is ZFS which I use to store media and
data files and for poudreire.
I don't recall the exact date, but it started fairly recently. System would
swap like crazy to a point when I cannot even ssh to it, and can hardly
login through tty: it might take 10-15 minutes to see a command typed in
the shell.
vfs.zfs.arc_max="4G"
vfs.zfs.prefetch_disable="1"
It fixed the problem, but introduced a new one. When I'm building stuff
with poudriere with ccache enabled, it takes hours to build even small
projects like curl or gnutls.
0 Skipped: 0 Ignored: 0 Tobuild: 2 Time: 06:48:35 [02]: security/gnutls
| gnutls-3.5.18 build (06:47:51)
Almost 7 hours already and still going!
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0 da0
0 1 0 0 0.0 1 128 0.7 0.1 ada0
1 106 106 439 64.6 0 0 0.0 98.8 ada1
0 1 0 0 0.0 1 128 0.7 0.1 ada0s1
0 0 0 0 0.0 0 0 0.0 0.0 ada0s1a
0 0 0 0 0.0 0 0 0.0 0.0 ada0s1b
0 1 0 0 0.0 1 128 0.7 0.1 ada0s1d
ada0 here is UFS driver, and ada1 is ZFS.

Post by Danilo G. Baio
Regards.
--
Danilo G. Baio (dbaio)

Roman Bogorodskiy

This is from a APU, no ZFS, UFS on a small mSATA device, the APU (PCenigine) works as a
last pid: 9665; load averages: 0.13, 0.13, 0.11
up 3+06:53:55 00:26:26 19 processes: 1 running, 18 sleeping CPU: 0.3% user, 0.0%
nice, 0.2% system, 0.0% interrupt, 99.5% idle Mem: 27M Active, 6200K Inact, 83M
Laundry, 185M Wired, 128K Buf, 675M Free Swap: 7808M Total, 2856K Used, 7805M Free
[...]
The APU is running CURRENT ( FreeBSD 12.0-CURRENT #42 r330608: Wed Mar 7 16:55:59 CET
2018 amd64). Usually, the APU never(!) uses swap, now it is starting to swap like hell
for a couple of days and I have to reboot it failty often.
after hours of building packages, I tried to copy the repository from one location on
the same ZFS volume to another - usually this task takes a couple of minutes for ~ 2200
ports. Now, I has taken 2 1/2 hours and the box got stuck, Ctrl-T on the console
load: 0.00 cmd: make 91199 [pfault] 7239.56r 0.03u 0.04s 0% 740k
No response from the box anymore.
The problem of swapping like hell and performing slow isn't an issue of the past days, it
is present at least since 1 1/2 weeks for now, even more. Since I build ports fairly
often, time taken on that specific box has increased from 2 to 3 days for all ~2200
ports. The system has 16 GB of RAM, IvyBridge 4-core XEON at 3,4 GHz, if this information
matters. The box is consuming swap really fast.
Today is the first time the machine got inresponsible (no ssh, no console login so far).
Need to coldstart. OS is CURRENT as well.

Any chance this is related to meltdown/spectre mitigation patches?

Best,
Michael

Post by O. Hartmann
Regards,
O. Hartmann
--
O. Hartmann
Ich widerspreche der Nutzung oder Übermittlung meiner Daten für
Werbezwecke oder für die Markt- oder Meinungsforschung (§ 28 Abs. 4 BDSG).

Jeff Roberson

2018-03-11 20:43:58 UTC

Post by O. Hartmann
Am Wed, 7 Mar 2018 14:39:13 +0400

Post by Roman Bogorodskiy

Just IRC/Slack, with no response.

Post by Rodney W. Grimes
If not then we have at least 2 reports of this unbound
wired memory growth, if so hopefully someone here can
take you further in the debug than we have been able
to get.

What can I provide? The system is still in this state as the full backup is
slow.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222288
g_bio leak.
vmstat -z | egrep 'ITEM|g_bio|UMA'
would be a good first look

Our experience as well.
...
Thanks,
Rod Grimes

Larry Rosenman http://www.lerctr.org/~ler

--
Larry Rosenman http://www.lerctr.org/~ler
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106

Hi.
I noticed this behavior as well and changed vfs.zfs.arc_max for a smaller size.
For me it started when I upgraded to 1200058, in this box I'm only using
poudriere for building tests.

I've noticed that as well.
I have 16G of RAM and two disks, the first one is UFS with the system
installation and the second one is ZFS which I use to store media and
data files and for poudreire.
I don't recall the exact date, but it started fairly recently. System would
swap like crazy to a point when I cannot even ssh to it, and can hardly
login through tty: it might take 10-15 minutes to see a command typed in
the shell.
vfs.zfs.arc_max="4G"
vfs.zfs.prefetch_disable="1"
It fixed the problem, but introduced a new one. When I'm building stuff
with poudriere with ccache enabled, it takes hours to build even small
projects like curl or gnutls.
0 Skipped: 0 Ignored: 0 Tobuild: 2 Time: 06:48:35 [02]: security/gnutls
| gnutls-3.5.18 build (06:47:51)
Almost 7 hours already and still going!
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0 da0
0 1 0 0 0.0 1 128 0.7 0.1 ada0
1 106 106 439 64.6 0 0 0.0 98.8 ada1
0 1 0 0 0.0 1 128 0.7 0.1 ada0s1
0 0 0 0 0.0 0 0 0.0 0.0 ada0s1a
0 0 0 0 0.0 0 0 0.0 0.0 ada0s1b
0 1 0 0 0.0 1 128 0.7 0.1 ada0s1d
ada0 here is UFS driver, and ada1 is ZFS.

Post by Danilo G. Baio
Regards.
--
Danilo G. Baio (dbaio)

Roman Bogorodskiy

Hi Folks,

This could be my fault from recent NUMA and concurrency related work. I
did touch some of the arc back-pressure mechanisms. First, I would like
to identify whether the wired memory is in the buffer cache. Can those of
you that have a repro look at sysctl vfs.bufspace and tell me if that
accounts for the bulk of your wired memory usage? I'm wondering if a job
ran that pulled in all of the bufs from your root disk and filled up the
buffer cache which doesn't have a back-pressure mechanism. Then arc
didn't respond appropriately to lower its usage.

Also, if you could try going back to r328953 or r326346 and let me know if
the problem exists in either. That would be very helpful. If anyone is
willing to debug this with me contact me directly and I will send some
test patches or debugging info after you have done the above steps.

Thank you for the reports.

Jeff

Post by O. Hartmann
--
O. Hartmann
Ich widerspreche der Nutzung oder Übermittlung meiner Daten für
Werbezwecke oder für die Markt- oder Meinungsforschung (§ 28 Abs. 4 BDSG).

Matthew D. Fuller

2018-03-11 22:13:13 UTC

On Sun, Mar 11, 2018 at 10:43:58AM -1000 I heard the voice of

First, I would like to identify whether the wired memory is in the
buffer cache. Can those of you that have a repro look at sysctl
vfs.bufspace and tell me if that accounts for the bulk of your wired
memory usage? I'm wondering if a job ran that pulled in all of the
bufs from your root disk and filled up the buffer cache which
doesn't have a back-pressure mechanism.

If by "root disk", you mean the one that isn't ZFS, that wouldn't
touch anything here; apart from a md-backed UFS /tmp and some NFS
mounts, everything on my system is ZFS.

I believe vfs.bufspace is what shows up as "Buf" on top? I don't
recall it looking particularly interesting when things were madly
swapping. I'll uncork arc_max again for a bit and see if anything odd
shows up in it, but it's only a dozen megs or so now.

--
Matthew Fuller (MF4839) | ***@over-yonder.net
Systems/Network Administrator | http://www.over-yonder.net/~fullermd/
On the Internet, nobody can hear you scream.

Jeff Roberson

2018-03-12 22:18:33 UTC

Post by Matthew D. Fuller
On Sun, Mar 11, 2018 at 10:43:58AM -1000 I heard the voice of

If by "root disk", you mean the one that isn't ZFS, that wouldn't
touch anything here; apart from a md-backed UFS /tmp and some NFS
mounts, everything on my system is ZFS.
I believe vfs.bufspace is what shows up as "Buf" on top? I don't
recall it looking particularly interesting when things were madly
swapping. I'll uncork arc_max again for a bit and see if anything odd
shows up in it, but it's only a dozen megs or so now.

You are right. I forgot that it was in top and didn't notice.

What I believe I need most is for someone to bisect a few revisions to let
me know if it was one of my two major patches.

Thanks,
Jeff

Post by Matthew D. Fuller
--
Systems/Network Administrator | http://www.over-yonder.net/~fullermd/
On the Internet, nobody can hear you scream.

Tom Rushworth

2018-03-11 23:19:05 UTC

Hi All,

On 11/03/2018 13:43, Jeff Roberson wrote:
[snip]

Post by Jeff Roberson
Hi Folks,
This could be my fault from recent NUMA and concurrency related work. I
did touch some of the arc back-pressure mechanisms. First, I would like
to identify whether the wired memory is in the buffer cache. Can those
of you that have a repro look at sysctl vfs.bufspace and tell me if that
accounts for the bulk of your wired memory usage? I'm wondering if a
job ran that pulled in all of the bufs from your root disk and filled up
the buffer cache which doesn't have a back-pressure mechanism. Then arc
didn't respond appropriately to lower its usage.
Also, if you could try going back to r328953 or r326346 and let me know
if the problem exists in either. That would be very helpful. If anyone
is willing to debug this with me contact me directly and I will send
some test patches or debugging info after you have done the above steps.
Thank you for the reports.
Jeff

[snip]

I'm seeing this on 11.1 stable r330126 with 32G of memory. I have two
physical storage devices (one SSD, one HD) each a separate ZFS pool and
I can reproduce this fairly easily and quickly with:

cp -r <dir_on_one_pool_with_lots_of_small_files> <dir_on_other_pool>

The directory being copied has about 25G (from du -sg), I end up with
16G wired after starting with less than 1G. After the copy:
sysctl vfs.bufspace --> 0

Out of curiosity I copied it back the other way and drove the wired
memory to 26G during the copy falling back to 24G once the copy
finished, with vfs.bufspace at 0.

I'm not really in a good position to roll back to r328953 (or anything
much earlier), my graphics HW (i915) needs something pretty recent.

I am running a custom kernel (I dropped a lot of the newtwork
interfaces), so if you need more info I'm willing to help, as long as
you explain what you need in short words :). (I'm not very familiar
with FreeBSD kernel work or sysadmin.)

Regards,

--
Tom Rushworth

bob prohaska

2018-03-18 21:08:27 UTC

Post by Jeff Roberson
Also, if you could try going back to r328953 or r326346 and let me know if
the problem exists in either. That would be very helpful. If anyone is

Not sure this is relevant, but r326343 is able to run a j4 buildworld
to completion on an RPi3 with 3 gigs of microSD-based swap. There are
periods of seeming distress at times (lots of swread, pfault state
in top along with high %idle) in top, but the compilation completes.

In contrast, r328953 could not complete buildworld using -j4. Buildworld
would stop, usually reporting c++ killed, apparently for want of swap,
even though swap usage never exceeded about 30% accoring to top.

The machine employs UFS filesystems, the kernel config can be seen at

http://www.zefox.net/~fbsd/rpi3/kernel_config/ZEFOX

Thanks for reading, I hope it's useful.

bob prohaska

Peter Jeremy

2018-03-20 07:07:45 UTC

Post by Jeff Roberson
Also, if you could try going back to r328953 or r326346 and let me know if
the problem exists in either. That would be very helpful. If anyone is
willing to debug this with me contact me directly and I will send some
test patches or debugging info after you have done the above steps.

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.

I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

--
Peter Jeremy

Bryan Drewery

2018-03-23 23:21:32 UTC

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

--
Regards,
Bryan Drewery

Slawa Olhovchenkov

2018-03-25 10:41:58 UTC

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

Can you test: revert all 'needfree' related commits and applay
https://reviews.freebsd.org/D7538 ?

Andriy Gapon

2018-03-27 14:00:09 UTC

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

First, it looks like maybe several different issues are being discussed and
possibly conflated in this thread. I see reports related to ZFS, I see reports
where ZFS is not used at all. Some people report problems that appeared very
recently while others chime in with "yes, yes, I've always had this problem".
This does not help to differentiate between problems and to analyze them.

Post by Bryan Drewery
Looking at the ARC change you referred to from r325851
https://reviews.freebsd.org/D12163, I am convinced that ARC backpressure
is completely broken.

Does your being convinced come from the code review or from experiments?
If the former, could you please share your analysis?

Post by Bryan Drewery
On my 78GB RAM system with ARC limited to 40GB and
doing a poudriere build of all LLVM and GCC packages at once in tmpfs I
can get swap up near 50GB and yet the ARC remains at 40GB through it
all. It's always been slow to give up memory for package builds but it
really seems broken right now.

Well, there are multiple angles. Maybe it's ARC that does not react properly,
or maybe it's not being asked properly.

It would be useful to monitor the system during its transition to the state that
you reported. There are some interesting DTrace probes in ARC, specifically
arc-available_memory and arc-needfree are first that come to mind. Their
parameters and how frequently they are called are of interest. VM parameters
could be of interest as well.

A rant.

Basically, posting some numbers and jumping to conclusions does not help at all.
Monitoring, graphing, etc does help.
ARC is a complex dynamic system.
VM (pagedaemon, UMA caches) is a complex dynamic system.
They interact in a complex dynamic ways.
Sometimes a change in ARC is incorrect and requires a fix.
Sometimes a change in VM is incorrect and requires a fix.
Sometimes a change in VM requires a change in ARC.
These three kinds of problems can all appear as a "problem with ARC".

For instance, when vm.lowmem_period was introduced you wouldn't find any mention
of ZFS/ARC. But it does affect period between arc_lowmem() calls.

Also, pin-pointing a specific commit requires proper bisecting and proper
testing to correctly attribute systemic behavior changes to code changes.

--
Andriy Gapon

Don Lewis

2018-04-01 19:53:32 UTC

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

Post by Bryan Drewery
Looking at the ARC change you referred to from r325851
https://reviews.freebsd.org/D12163, I am convinced that ARC backpressure
is completely broken.

Does your being convinced come from the code review or from experiments?
If the former, could you please share your analysis?

Well, there are multiple angles. Maybe it's ARC that does not react properly,
or maybe it's not being asked properly.
It would be useful to monitor the system during its transition to the state that
you reported. There are some interesting DTrace probes in ARC, specifically
arc-available_memory and arc-needfree are first that come to mind. Their
parameters and how frequently they are called are of interest. VM parameters
could be of interest as well.
A rant.
Basically, posting some numbers and jumping to conclusions does not help at all.
Monitoring, graphing, etc does help.
ARC is a complex dynamic system.
VM (pagedaemon, UMA caches) is a complex dynamic system.
They interact in a complex dynamic ways.
Sometimes a change in ARC is incorrect and requires a fix.
Sometimes a change in VM is incorrect and requires a fix.
Sometimes a change in VM requires a change in ARC.
These three kinds of problems can all appear as a "problem with ARC".
For instance, when vm.lowmem_period was introduced you wouldn't find any mention
of ZFS/ARC. But it does affect period between arc_lowmem() calls.
Also, pin-pointing a specific commit requires proper bisecting and proper
testing to correctly attribute systemic behavior changes to code changes.

I just upgraded my main package build box (12.0-CURRENT, 8 cores, 32 GB
RAM) from r327616 to r331716. I was seeing higher swap usage and larger
ARC sizes before the upgrade than I remember from the distant past, but
ARC was still at least somewhat responsive to memory pressure and I
didn't notice any performance issues.

After the upgrade, ARC size seems to be pretty unresponsive to memory
demand. Currently the machine is near the end of a poudriere run to
build my usual set of ~1800 ports. The only currently running build is
chromium and the machine is paging heavily. Settings of interest are:
USE_TMPFS="wrkdir data localbase"
ALLOW_MAKE_JOBS=yes

last pid: 96239; load averages: 1.86, 1.76, 1.83 up 3+14:47:00 12:38:11
108 processes: 3 running, 105 sleeping
CPU: 18.6% user, 0.0% nice, 2.4% system, 0.0% interrupt, 79.0% idle
Mem: 129M Active, 865M Inact, 61M Laundry, 29G Wired, 1553K Buf, 888M Free
ARC: 23G Total, 8466M MFU, 10G MRU, 5728K Anon, 611M Header, 3886M Other
17G Compressed, 32G Uncompressed, 1.88:1 Ratio
Swap: 40G Total, 17G Used, 23G Free, 42% Inuse, 4756K In

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAN
96239 nobody 1 76 0 140M 93636K CPU5 5 0:01 82.90% clang-
96238 nobody 1 75 0 140M 92608K CPU7 7 0:01 80.81% clang-
5148 nobody 1 20 0 590M 113M swread 0 0:31 0.29% clang-
57290 root 1 20 0 12128K 2608K zio->i 7 8:11 0.28% find
78958 nobody 1 20 0 838M 299M swread 0 0:23 0.19% clang-
97840 nobody 1 20 0 698M 140M swread 4 0:27 0.13% clang-
96066 nobody 1 20 0 463M 104M swread 1 0:32 0.12% clang-
11050 nobody 1 20 0 892M 154M swread 2 0:39 0.12% clang-

Pre-upgrade I was running r327616, which is newer than either of the
commits that Jeff mentioned above. It seems like there has been a
regression since then.

I also don't recall seeing this problem on my Ryzen box, though it has
2x the core count and 2x the RAM. The last testing that I did on it was
with r329844.

Don Lewis

2018-04-04 01:00:51 UTC

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

Post by Bryan Drewery
Looking at the ARC change you referred to from r325851
https://reviews.freebsd.org/D12163, I am convinced that ARC backpressure
is completely broken.

Does your being convinced come from the code review or from experiments?
If the former, could you please share your analysis?

Well, there are multiple angles. Maybe it's ARC that does not react properly,
or maybe it's not being asked properly.
It would be useful to monitor the system during its transition to the state that
you reported. There are some interesting DTrace probes in ARC, specifically
arc-available_memory and arc-needfree are first that come to mind. Their
parameters and how frequently they are called are of interest. VM parameters
could be of interest as well.
A rant.
Basically, posting some numbers and jumping to conclusions does not help at all.
Monitoring, graphing, etc does help.
ARC is a complex dynamic system.
VM (pagedaemon, UMA caches) is a complex dynamic system.
They interact in a complex dynamic ways.
Sometimes a change in ARC is incorrect and requires a fix.
Sometimes a change in VM is incorrect and requires a fix.
Sometimes a change in VM requires a change in ARC.
These three kinds of problems can all appear as a "problem with ARC".
For instance, when vm.lowmem_period was introduced you wouldn't find any mention
of ZFS/ARC. But it does affect period between arc_lowmem() calls.
Also, pin-pointing a specific commit requires proper bisecting and proper
testing to correctly attribute systemic behavior changes to code changes.

I just upgraded my main package build box (12.0-CURRENT, 8 cores, 32 GB
RAM) from r327616 to r331716. I was seeing higher swap usage and larger
ARC sizes before the upgrade than I remember from the distant past, but
ARC was still at least somewhat responsive to memory pressure and I
didn't notice any performance issues.
After the upgrade, ARC size seems to be pretty unresponsive to memory
demand. Currently the machine is near the end of a poudriere run to
build my usual set of ~1800 ports. The only currently running build is
USE_TMPFS="wrkdir data localbase"
ALLOW_MAKE_JOBS=yes
last pid: 96239; load averages: 1.86, 1.76, 1.83 up 3+14:47:00 12:38:11
108 processes: 3 running, 105 sleeping
CPU: 18.6% user, 0.0% nice, 2.4% system, 0.0% interrupt, 79.0% idle
Mem: 129M Active, 865M Inact, 61M Laundry, 29G Wired, 1553K Buf, 888M Free
ARC: 23G Total, 8466M MFU, 10G MRU, 5728K Anon, 611M Header, 3886M Other
17G Compressed, 32G Uncompressed, 1.88:1 Ratio
Swap: 40G Total, 17G Used, 23G Free, 42% Inuse, 4756K In
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAN
96239 nobody 1 76 0 140M 93636K CPU5 5 0:01 82.90% clang-
96238 nobody 1 75 0 140M 92608K CPU7 7 0:01 80.81% clang-
5148 nobody 1 20 0 590M 113M swread 0 0:31 0.29% clang-
57290 root 1 20 0 12128K 2608K zio->i 7 8:11 0.28% find
78958 nobody 1 20 0 838M 299M swread 0 0:23 0.19% clang-
97840 nobody 1 20 0 698M 140M swread 4 0:27 0.13% clang-
96066 nobody 1 20 0 463M 104M swread 1 0:32 0.12% clang-
11050 nobody 1 20 0 892M 154M swread 2 0:39 0.12% clang-
Pre-upgrade I was running r327616, which is newer than either of the
commits that Jeff mentioned above. It seems like there has been a
regression since then.
I also don't recall seeing this problem on my Ryzen box, though it has
2x the core count and 2x the RAM. The last testing that I did on it was
with r329844.

I reconfigured my Ryzen box to be more similar to my default package
builder by disabling SMT and half of the RAM, to limit it to 8 cores
and 32 GB and then started bisecting to try to track down the problem.
For each test, I first filled ARC by tarring /usr/ports/distfiles to
/dev/null. The commit range that I was searching was r329844 to
r331716. I narrowed the range to r329844 to r329904. With r329904
and newer, ARC is totally unresponsive to memory pressure and the
machine pages heavily. I see ARC sizes of 28-29GB and 30GB of wired
RAM, so there is not much leftover for getting useful work done. Active
memory and free memory both hover under 1GB each. Looking at the
commit logs over this range, the most likely culprit is:

r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13 lines

Add a generic Proportional Integral Derivative (PID) controller algorithm and
use it to regulate page daemon output.

This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.

It is quite possible that the recent fixes to the PID controller will
fix the problem. Not that r329844 was trouble free ... I left tar
running over lunchtime to fill ARC and the OOM killer nuked top, tar,
ntpd, both of my ssh sessions into the machine, and multiple instances
of getty while I was away. I was able to log in again and successfully
run poudriere, and ARC did respond to the memory pressure and cranked
itself down to about 5 GB by the end of the run. I did not see the same
problem with tar when I did the same with r329904.

Don Lewis

2018-04-04 04:42:48 UTC

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

Post by Bryan Drewery
Looking at the ARC change you referred to from r325851
https://reviews.freebsd.org/D12163, I am convinced that ARC backpressure
is completely broken.

Does your being convinced come from the code review or from experiments?
If the former, could you please share your analysis?

Well, there are multiple angles. Maybe it's ARC that does not react properly,
or maybe it's not being asked properly.
It would be useful to monitor the system during its transition to the state that
you reported. There are some interesting DTrace probes in ARC, specifically
arc-available_memory and arc-needfree are first that come to mind. Their
parameters and how frequently they are called are of interest. VM parameters
could be of interest as well.
A rant.
Basically, posting some numbers and jumping to conclusions does not help at all.
Monitoring, graphing, etc does help.
ARC is a complex dynamic system.
VM (pagedaemon, UMA caches) is a complex dynamic system.
They interact in a complex dynamic ways.
Sometimes a change in ARC is incorrect and requires a fix.
Sometimes a change in VM is incorrect and requires a fix.
Sometimes a change in VM requires a change in ARC.
These three kinds of problems can all appear as a "problem with ARC".
For instance, when vm.lowmem_period was introduced you wouldn't find any mention
of ZFS/ARC. But it does affect period between arc_lowmem() calls.
Also, pin-pointing a specific commit requires proper bisecting and proper
testing to correctly attribute systemic behavior changes to code changes.

I just upgraded my main package build box (12.0-CURRENT, 8 cores, 32 GB
RAM) from r327616 to r331716. I was seeing higher swap usage and larger
ARC sizes before the upgrade than I remember from the distant past, but
ARC was still at least somewhat responsive to memory pressure and I
didn't notice any performance issues.
After the upgrade, ARC size seems to be pretty unresponsive to memory
demand. Currently the machine is near the end of a poudriere run to
build my usual set of ~1800 ports. The only currently running build is
USE_TMPFS="wrkdir data localbase"
ALLOW_MAKE_JOBS=yes
last pid: 96239; load averages: 1.86, 1.76, 1.83 up 3+14:47:00 12:38:11
108 processes: 3 running, 105 sleeping
CPU: 18.6% user, 0.0% nice, 2.4% system, 0.0% interrupt, 79.0% idle
Mem: 129M Active, 865M Inact, 61M Laundry, 29G Wired, 1553K Buf, 888M Free
ARC: 23G Total, 8466M MFU, 10G MRU, 5728K Anon, 611M Header, 3886M Other
17G Compressed, 32G Uncompressed, 1.88:1 Ratio
Swap: 40G Total, 17G Used, 23G Free, 42% Inuse, 4756K In
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAN
96239 nobody 1 76 0 140M 93636K CPU5 5 0:01 82.90% clang-
96238 nobody 1 75 0 140M 92608K CPU7 7 0:01 80.81% clang-
5148 nobody 1 20 0 590M 113M swread 0 0:31 0.29% clang-
57290 root 1 20 0 12128K 2608K zio->i 7 8:11 0.28% find
78958 nobody 1 20 0 838M 299M swread 0 0:23 0.19% clang-
97840 nobody 1 20 0 698M 140M swread 4 0:27 0.13% clang-
96066 nobody 1 20 0 463M 104M swread 1 0:32 0.12% clang-
11050 nobody 1 20 0 892M 154M swread 2 0:39 0.12% clang-
Pre-upgrade I was running r327616, which is newer than either of the
commits that Jeff mentioned above. It seems like there has been a
regression since then.
I also don't recall seeing this problem on my Ryzen box, though it has
2x the core count and 2x the RAM. The last testing that I did on it was
with r329844.

I reconfigured my Ryzen box to be more similar to my default package
builder by disabling SMT and half of the RAM, to limit it to 8 cores
and 32 GB and then started bisecting to try to track down the problem.
For each test, I first filled ARC by tarring /usr/ports/distfiles to
/dev/null. The commit range that I was searching was r329844 to
r331716. I narrowed the range to r329844 to r329904. With r329904
and newer, ARC is totally unresponsive to memory pressure and the
machine pages heavily. I see ARC sizes of 28-29GB and 30GB of wired
RAM, so there is not much leftover for getting useful work done. Active
memory and free memory both hover under 1GB each. Looking at the
r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13 lines
Add a generic Proportional Integral Derivative (PID) controller algorithm and
use it to regulate page daemon output.
This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.
It is quite possible that the recent fixes to the PID controller will
fix the problem. Not that r329844 was trouble free ... I left tar
running over lunchtime to fill ARC and the OOM killer nuked top, tar,
ntpd, both of my ssh sessions into the machine, and multiple instances
of getty while I was away. I was able to log in again and successfully
run poudriere, and ARC did respond to the memory pressure and cranked
itself down to about 5 GB by the end of the run. I did not see the same
problem with tar when I did the same with r329904.

I just tried r331966 and see no improvement. No OOM process kills
during the tar run to fill ARC, but with ARC filled, the machine is
thrashing itself at the start of the poudriere run while trying to build
ports-mgmt/pkg (39 minutes so far). ARC appears to be unresponsive to
memory demand. I've seen no decrease in ARC size or wired memory since
starting poudriere.

last pid: 75652; load averages: 0.46, 0.27, 0.25 up 0+02:03:48 21:33:00
48 processes: 1 running, 47 sleeping
CPU: 0.7% user, 0.0% nice, 0.8% system, 0.1% interrupt, 98.4% idle
Mem: 4196K Active, 328K Inact, 148K Laundry, 30G Wired, 506M Free
ARC: 29G Total, 309M MFU, 28G MRU, 3767K Anon, 103M Header, 79M Other
27G Compressed, 29G Uncompressed, 1.05:1 Ratio
Swap: 80G Total, 234M Used, 80G Free, 2348K In, 420K Out

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
1340 root 1 52 0 13048K 1024K nanslp 6 0:10 0.72% sh
41806 nobody 1 20 0 133M 1440K swread 6 0:17 0.31% cc
42554 nobody 1 20 0 129M 1048K swread 6 0:12 0.22% cc
934 dl 1 20 0 19124K 592K select 6 0:03 0.14% sshd
938 dl 1 20 0 13596K 740K CPU7 7 0:07 0.12% top
40784 nobody 1 20 0 5024K 104K select 5 0:00 0.03% make
41142 nobody 1 20 0 5024K 124K select 6 0:00 0.01% make
638 root 1 20 0 11344K 252K swread 7 0:00 0.01% syslogd
41428 nobody 1 20 0 5024K 124K select 4 0:00 0.01% make
40782 nobody 1 20 0 5024K 124K select 4 0:00 0.01% make
41164 nobody 1 20 0 5024K 124K select 7 0:00 0.01% make
1050 root 1 23 0 13032K 468K select 0 0:06 0.00% sh
848 root 1 20 0 11400K 492K nanslp 5 0:00 0.00% cron
36198 root 1 20 0 13048K 396K select 7 0:00 0.00% sh

Mark Johnston

2018-04-04 17:49:49 UTC

Post by Don Lewis
I reconfigured my Ryzen box to be more similar to my default package
builder by disabling SMT and half of the RAM, to limit it to 8 cores
and 32 GB and then started bisecting to try to track down the problem.
For each test, I first filled ARC by tarring /usr/ports/distfiles to
/dev/null. The commit range that I was searching was r329844 to
r331716. I narrowed the range to r329844 to r329904. With r329904
and newer, ARC is totally unresponsive to memory pressure and the
machine pages heavily. I see ARC sizes of 28-29GB and 30GB of wired
RAM, so there is not much leftover for getting useful work done. Active
memory and free memory both hover under 1GB each. Looking at the
r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13 lines
Add a generic Proportional Integral Derivative (PID) controller algorithm and
use it to regulate page daemon output.
This provides much smoother and more responsive page daemon output, anticipating
demand and avoiding pageout stalls by increasing the number of pages to match
the workload. This is a reimplementation of work done by myself and mlaier at
Isilon.
It is quite possible that the recent fixes to the PID controller will
fix the problem. Not that r329844 was trouble free ... I left tar
running over lunchtime to fill ARC and the OOM killer nuked top, tar,
ntpd, both of my ssh sessions into the machine, and multiple instances
of getty while I was away. I was able to log in again and successfully
run poudriere, and ARC did respond to the memory pressure and cranked
itself down to about 5 GB by the end of the run. I did not see the same
problem with tar when I did the same with r329904.

Re-reading the ARC reclaim code, I see a couple of issues which might be
at the root of the behaviour you're seeing.

1. zfs_arc_free_target is too low now. It is initialized to the page
daemon wakeup threshold, which is slightly above v_free_min. With the
PID controller, the page daemon uses a setpoint of v_free_target.
Moreover, it now wakes up regularly rather than having wakeups be
synchronized by a mutex, so it will respond quickly if the free page
count dips below v_free_target. The free page count will dip below
zfs_arc_free_target only in the face of sudden and extreme memory
pressure now, so the FMT_LOTSFREE case probably isn't getting
exercised. Try initializing zfs_arc_free_target to v_free_target.

2. In the inactive queue scan, we used to compute the shortage after
running uma_reclaim() and the lowmem handlers (which includes a
synchronous call to arc_lowmem()). Now it's computed before, so we're
not taking into account the pages that get freed by the ARC and UMA.
The following rather hacky patch may help. I note that the lowmem
logic is now somewhat broken when multiple NUMA domains are
configured, however, since it fires only when domain 0 has a free
page shortage.

Index: sys/vm/vm_pageout.c
===================================================================
--- sys/vm/vm_pageout.c (revision 331933)
+++ sys/vm/vm_pageout.c (working copy)
@@ -1114,25 +1114,6 @@
boolean_t queue_locked;

/*
- * If we need to reclaim memory ask kernel caches to return
- * some. We rate limit to avoid thrashing.
- */
- if (vmd == VM_DOMAIN(0) && pass > 0 &&
- (time_uptime - lowmem_uptime) >= lowmem_period) {
- /*
- * Decrease registered cache sizes.
- */
- SDT_PROBE0(vm, , , vm__lowmem_scan);
- EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
- /*
- * We do this explicitly after the caches have been
- * drained above.
- */
- uma_reclaim();
- lowmem_uptime = time_uptime;
- }
-
- /*
* The addl_page_shortage is the number of temporarily
* stuck pages in the inactive queue. In other words, the
* number of pages from the inactive count that should be
@@ -1824,6 +1805,26 @@
atomic_store_int(&vmd->vmd_pageout_wanted, 1);

/*
+ * If we need to reclaim memory ask kernel caches to return
+ * some. We rate limit to avoid thrashing.
+ */
+ if (vmd == VM_DOMAIN(0) &&
+ vmd->vmd_free_count < vmd->vmd_free_target &&
+ (time_uptime - lowmem_uptime) >= lowmem_period) {
+ /*
+ * Decrease registered cache sizes.
+ */
+ SDT_PROBE0(vm, , , vm__lowmem_scan);
+ EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
+ /*
+ * We do this explicitly after the caches have been
+ * drained above.
+ */
+ uma_reclaim();
+ lowmem_uptime = time_uptime;
+ }
+
+ /*
* Use the controller to calculate how many pages to free in
* this interval.
*/

Don Lewis

2018-04-04 18:17:16 UTC

Changing zfs_arc_free_target definitely helps. My previous poudriere
run failed when poudriere timed out the ports-mgmt/pkg build after two
hours. After changing this setting, poudriere seems to be running
properly and ARC has dropped from 29GB to 26GB ten minutes into the run
and I'm not seeing processes in the swread state.

Post by Mark Johnston
2. In the inactive queue scan, we used to compute the shortage after
running uma_reclaim() and the lowmem handlers (which includes a
synchronous call to arc_lowmem()). Now it's computed before, so we're
not taking into account the pages that get freed by the ARC and UMA.
The following rather hacky patch may help. I note that the lowmem
logic is now somewhat broken when multiple NUMA domains are
configured, however, since it fires only when domain 0 has a free
page shortage.

I will try this next.

Post by Mark Johnston
Index: sys/vm/vm_pageout.c
===================================================================
--- sys/vm/vm_pageout.c (revision 331933)
+++ sys/vm/vm_pageout.c (working copy)
@@ -1114,25 +1114,6 @@
boolean_t queue_locked;
/*
- * If we need to reclaim memory ask kernel caches to return
- * some. We rate limit to avoid thrashing.
- */
- if (vmd == VM_DOMAIN(0) && pass > 0 &&
- (time_uptime - lowmem_uptime) >= lowmem_period) {
- /*
- * Decrease registered cache sizes.
- */
- SDT_PROBE0(vm, , , vm__lowmem_scan);
- EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
- /*
- * We do this explicitly after the caches have been
- * drained above.
- */
- uma_reclaim();
- lowmem_uptime = time_uptime;
- }
-
- /*
* The addl_page_shortage is the number of temporarily
* stuck pages in the inactive queue. In other words, the
* number of pages from the inactive count that should be
@@ -1824,6 +1805,26 @@
atomic_store_int(&vmd->vmd_pageout_wanted, 1);
/*
+ * If we need to reclaim memory ask kernel caches to return
+ * some. We rate limit to avoid thrashing.
+ */
+ if (vmd == VM_DOMAIN(0) &&
+ vmd->vmd_free_count < vmd->vmd_free_target &&
+ (time_uptime - lowmem_uptime) >= lowmem_period) {
+ /*
+ * Decrease registered cache sizes.
+ */
+ SDT_PROBE0(vm, , , vm__lowmem_scan);
+ EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
+ /*
+ * We do this explicitly after the caches have been
+ * drained above.
+ */
+ uma_reclaim();
+ lowmem_uptime = time_uptime;
+ }
+
+ /*
* Use the controller to calculate how many pages to free in
* this interval.
*/

Justin Hibbits

2018-04-06 00:47:14 UTC

Post by Stefan Esser

Active

Post by Don Lewis
memory and free memory both hover under 1GB each. Looking at the
r329882 | jeff | 2018-02-23 14:51:51 -0800 (Fri, 23 Feb 2018) | 13

lines

Post by Don Lewis
Add a generic Proportional Integral Derivative (PID) controller

algorithm and

Post by Don Lewis
use it to regulate page daemon output.
This provides much smoother and more responsive page daemon output,

anticipating

Post by Don Lewis
demand and avoiding pageout stalls by increasing the number of pages

to match

Post by Don Lewis
the workload. This is a reimplementation of work done by myself and

mlaier at

Post by Don Lewis
Isilon.
It is quite possible that the recent fixes to the PID controller will
fix the problem. Not that r329844 was trouble free ... I left tar
running over lunchtime to fill ARC and the OOM killer nuked top, tar,
ntpd, both of my ssh sessions into the machine, and multiple instances
of getty while I was away. I was able to log in again and

successfully

Post by Don Lewis
run poudriere, and ARC did respond to the memory pressure and cranked
itself down to about 5 GB by the end of the run. I did not see the

same

Post by Don Lewis
problem with tar when I did the same with r329904.

I will try this next.

Post by Mark Johnston
* this interval.
*/

My powerpc64 embedded machine is virtually unusable since these vm changes.
I tried setting vfs.zfs.arc_free_target as suggested, and that didn't help
at all. Eventually the machine hangs and just gets stuck in vmdaemon, with
many processes in wait channel btalloc.

- Justin

Mark Johnston

2018-04-06 15:08:14 UTC

Post by Justin Hibbits
My powerpc64 embedded machine is virtually unusable since these vm changes.
I tried setting vfs.zfs.arc_free_target as suggested, and that didn't help
at all. Eventually the machine hangs and just gets stuck in vmdaemon, with
many processes in wait channel btalloc.

You don't really have the same symptoms that Don is reporting.
Threads being stuck in btalloc implies a KVA shortage. So:
- What do you mean by "stuck in vmdaemon"?
- Which commits specifically cause problems? Does reverting r329882 fix the
hang?
- Can you break to DDB and get "show page" output when the hang occurs?
- What is the system doing to cause the hang to occur?

Justin Hibbits

2018-04-06 15:25:06 UTC

You don't really have the same symptoms that Don is reporting.

Okay. I latched onto the thread because it seemed similar.

Post by Mark Johnston
- What do you mean by "stuck in vmdaemon"?

The machine hangs, and my ssh sessions get killed. I can't do
anything at the console except break into kdb. When I do, the running
thread is always vmdaemon.

Post by Mark Johnston
- Which commits specifically cause problems? Does reverting r329882 fix the
hang?

I'll try reverting it and report back. Thankfully I can buildkernel
successfully on the machine before it hangs. Can't do more than that,
though.

Post by Mark Johnston
- Can you break to DDB and get "show page" output when the hang occurs?

I'll reproduce and get numbers today, but I do know the free_count was
high (6 digits), much higher than the free_min. when I checked
yesterday. I'm surprised it's running out of KVA, I've never had the
problem before with the same workloads, and has ~7.5GB KVA (almost the
same size as the total RAM in the machine).

Post by Mark Johnston
- What is the system doing to cause the hang to occur?

Just a simple buildworld with 2 or 3 jobs (tried both). It's 100% reproducible.

- Justin

Justin Hibbits

2018-04-07 19:34:54 UTC

Post by Justin Hibbits

You don't really have the same symptoms that Don is reporting.

Okay. I latched onto the thread because it seemed similar.

Post by Mark Johnston
- What do you mean by "stuck in vmdaemon"?

The machine hangs, and my ssh sessions get killed. I can't do
anything at the console except break into kdb. When I do, the running
thread is always vmdaemon.

Post by Mark Johnston
- Which commits specifically cause problems? Does reverting r329882 fix the
hang?

I'll try reverting it and report back. Thankfully I can buildkernel
successfully on the machine before it hangs. Can't do more than that,
though.

I reverted back to r329881, and successfully built world. Updated to
r329882 and it got stuck with processes in btalloc.

I've seen other reports of r329882 causing problems on powerpc64 as
well, with other bizarre behaviors.

- Justin

Mark Johnston

2018-04-07 19:49:35 UTC

Post by Justin Hibbits

You don't really have the same symptoms that Don is reporting.

Okay. I latched onto the thread because it seemed similar.

Post by Mark Johnston
- What do you mean by "stuck in vmdaemon"?

The machine hangs, and my ssh sessions get killed. I can't do
anything at the console except break into kdb. When I do, the running
thread is always vmdaemon.

Post by Mark Johnston
- Which commits specifically cause problems? Does reverting r329882 fix the
hang?

I'll try reverting it and report back. Thankfully I can buildkernel
successfully on the machine before it hangs. Can't do more than that,
though.

I reverted back to r329881, and successfully built world. Updated to
r329882 and it got stuck with processes in btalloc.
I've seen other reports of r329882 causing problems on powerpc64 as
well, with other bizarre behaviors.

Did you try the patch that I had posted? If not, could you? Please also
update zfs_arc_free_target or just apply D14994.

Don Lewis

2018-04-07 19:55:56 UTC

Post by Justin Hibbits

You don't really have the same symptoms that Don is reporting.

Okay. I latched onto the thread because it seemed similar.

Post by Mark Johnston
- What do you mean by "stuck in vmdaemon"?

The machine hangs, and my ssh sessions get killed. I can't do
anything at the console except break into kdb. When I do, the running
thread is always vmdaemon.

Post by Mark Johnston
- Which commits specifically cause problems? Does reverting r329882 fix the
hang?

I'll try reverting it and report back. Thankfully I can buildkernel
successfully on the machine before it hangs. Can't do more than that,
though.

I reverted back to r329881, and successfully built world. Updated to
r329882 and it got stuck with processes in btalloc.
I've seen other reports of r329882 causing problems on powerpc64 as
well, with other bizarre behaviors.

I initially missed that this was on powerpc. I believe that arch is a
bit odd with characters unsigned by default and arithmentic shifts
always being unsigned. Things like that could mess up the algorithm,
but I didn't see anything suspicious in the r329882 diff.

Justin Hibbits

2018-04-07 19:59:14 UTC

Hi Don,

Post by Don Lewis
I initially missed that this was on powerpc. I believe that arch is a
bit odd with characters unsigned by default and arithmentic shifts
always being unsigned. Things like that could mess up the algorithm,
but I didn't see anything suspicious in the r329882 diff.

I think the biggest difference for powerpc64 vs amd64 in this case is
the (currently) relatively small KVA (7.5GB vs 2(?) TB). That might
make a difference regarding VM pageout calculations.

- Justin

Don Lewis

2018-04-06 17:33:19 UTC

I will try this next.

The patch by itself is not sufficient to fix the problem for me. I
didn't have any problems with using the patch as well as setting
zfs_arc_free_target. As a matter of fact, that was the only poudriere
run where I didn't have a guile-related build failure. Those tend to
be fairly random, so it could just be luck.

Performance-wise r331966 + zfs_arc_free_target completes the poudriere
run about 2.6% faster than r329844. But I don't know if this is the PID
controller or something else that changed in base over that interval.

Don Lewis

2018-04-06 17:33:26 UTC

Re-reading the ARC reclaim code, I see a couple of issues which might be
at the root of the behaviour you're seeing.
1. zfs_arc_free_target is too low now. It is initialized to the page
daemon wakeup threshold, which is slightly above v_free_min. With the
PID controller, the page daemon uses a setpoint of v_free_target.
Moreover, it now wakes up regularly rather than having wakeups be
synchronized by a mutex, so it will respond quickly if the free page
count dips below v_free_target. The free page count will dip below
zfs_arc_free_target only in the face of sudden and extreme memory
pressure now, so the FMT_LOTSFREE case probably isn't getting
exercised. Try initializing zfs_arc_free_target to v_free_target.
2. In the inactive queue scan, we used to compute the shortage after
running uma_reclaim() and the lowmem handlers (which includes a
synchronous call to arc_lowmem()). Now it's computed before, so we're
not taking into account the pages that get freed by the ARC and UMA.
The following rather hacky patch may help. I note that the lowmem
logic is now somewhat broken when multiple NUMA domains are
configured, however, since it fires only when domain 0 has a free
page shortage.
Index: sys/vm/vm_pageout.c
===================================================================
--- sys/vm/vm_pageout.c (revision 331933)
+++ sys/vm/vm_pageout.c (working copy)
@@ -1114,25 +1114,6 @@
boolean_t queue_locked;
/*
- * If we need to reclaim memory ask kernel caches to return
- * some. We rate limit to avoid thrashing.
- */
- if (vmd == VM_DOMAIN(0) && pass > 0 &&
- (time_uptime - lowmem_uptime) >= lowmem_period) {
- /*
- * Decrease registered cache sizes.
- */
- SDT_PROBE0(vm, , , vm__lowmem_scan);
- EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
- /*
- * We do this explicitly after the caches have been
- * drained above.
- */
- uma_reclaim();
- lowmem_uptime = time_uptime;
- }
-
- /*
* The addl_page_shortage is the number of temporarily
* stuck pages in the inactive queue. In other words, the
* number of pages from the inactive count that should be
@@ -1824,6 +1805,26 @@
atomic_store_int(&vmd->vmd_pageout_wanted, 1);
/*
+ * If we need to reclaim memory ask kernel caches to return
+ * some. We rate limit to avoid thrashing.
+ */
+ if (vmd == VM_DOMAIN(0) &&
+ vmd->vmd_free_count < vmd->vmd_free_target &&
+ (time_uptime - lowmem_uptime) >= lowmem_period) {
+ /*
+ * Decrease registered cache sizes.
+ */
+ SDT_PROBE0(vm, , , vm__lowmem_scan);
+ EVENTHANDLER_INVOKE(vm_lowmem, VM_LOW_PAGES);
+ /*
+ * We do this explicitly after the caches have been
+ * drained above.
+ */
+ uma_reclaim();
+ lowmem_uptime = time_uptime;
+ }
+
+ /*
* Use the controller to calculate how many pages to free in
* this interval.
*/

Thomas Steen Rasmussen

2018-03-20 15:09:48 UTC

Post by Jeff Roberson
Also, if you could try going back to r328953 or r326346 and let me
know if the problem exists in either. That would be very helpful. If
anyone is willing to debug this with me contact me directly and I will
send some test patches or debugging info after you have done the above
steps.

Hello,

I am seeing this issue (high swap, arc not backing down) on two
jail/bhyve hosts running 11-STABLE r325275 and r325235 - which sounds
like it is earlier than the two patches you mention?

The two machines are at 98 and 138 days uptime, and both are currently
using more than 90% swap, and I've had to shut down non-critical stuff
because I was getting out-of-swap errors.

Just wanted to let everyone know, since I haven't seen any revisions as
early as r325275 in the "me too" posts here.

More information available on request.

Best regards,

Thomas Steen Rasmussen

Matthew D. Fuller

2018-03-06 18:10:54 UTC

On Mon, Mar 05, 2018 at 02:39:18PM -0600 I heard the voice of

Post by Larry Rosenman
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
Ideas?

Since I updated to the Feb 25 -CURRENT I'm currently running (from
mid-Sept, I believe), I see similar. It seems like the ARC has gotten
really unwilling to yield, so it grows up in size, and then doesn't
let up under pressure. I saw programs being actively used constantly
swapping their working set in and out, since they were left with tiny
available memory.

Hard-slappping vfs.zfs.arc_max down a ways mitigated it enough to get
me through the days, but is a pretty gross hackaround...

--
Matthew Fuller (MF4839) | ***@over-yonder.net
Systems/Network Administrator | http://www.over-yonder.net/~fullermd/
On the Internet, nobody can hear you scream.

Kurt Jaeger

2018-03-06 20:20:17 UTC

Hi!

Post by Matthew D. Fuller

Post by Larry Rosenman
Yesterday, and I'm seeing really strange slowness, ARC use, and SWAP use and swapping.
Ideas?

[...]

Post by Matthew D. Fuller
Hard-slappping vfs.zfs.arc_max down a ways mitigated it enough to get
me through the days, but is a pretty gross hackaround...

That's what I did as well.

--
***@opsec.eu +49 171 3101372 2 years to go !

Mark Millard

2018-03-12 03:30:15 UTC

As I understand, O. Hartmann's report ( ohartmann at walstatt.org ) in:

https://lists.freebsd.org/pipermail/freebsd-current/2018-March/068806.html

Post by O. Hartmann
This is from a APU, no ZFS, UFS on a small mSATA device, the APU (PCenigine) works as a
last pid: 9665; load averages: 0.13, 0.13, 0.11
up 3+06:53:55 00:26:26 19 processes: 1 running, 18 sleeping CPU: 0.3% user, 0.0%
nice, 0.2% system, 0.0% interrupt, 99.5% idle Mem: 27M Active, 6200K Inact, 83M
Laundry, 185M Wired, 128K Buf, 675M Free Swap: 7808M Total, 2856K Used, 7805M Free
[...]
The APU is running CURRENT ( FreeBSD 12.0-CURRENT #42 r330608: Wed Mar 7 16:55:59 CET
2018 amd64). Usually, the APU never(!) uses swap, now it is starting to swap like hell
for a couple of days and I have to reboot it failty often.

Unless this is unrelated, it would suggest that ZFS and its ARC need not
be involved.

Would what you are investigating relative to your "NUMA and concurrency
related work" fit with such a non-ZFS (no-ARC) context?

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

Jeff Roberson

2018-03-12 04:00:35 UTC

Post by Mark Millard
https://lists.freebsd.org/pipermail/freebsd-current/2018-March/068806.html

Unless this is unrelated, it would suggest that ZFS and its ARC need not
be involved.
Would what you are investigating relative to your "NUMA and concurrency
related work" fit with such a non-ZFS (no-ARC) context?

I think there are probably two different bugs. I believe the pid
controller has caused the laundry thread to start being more aggressive
causing more pageouts which would cause increased swap consumption.

The back-pressure mechanisms in arch should've resolved the other reports.
It's possible that I broke those. Although if the reports from 11.x are
to be believed I don't know that it was me. It is possible they have been
broken at different times for different reasons. So I will continue to
look.

Thanks,
Jeff

Post by Mark Millard
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

O. Hartmann

2018-03-17 09:38:48 UTC

Am Sun, 11 Mar 2018 18:00:35 -1000 (HST)

Post by Jeff Roberson

Post by Mark Millard
https://lists.freebsd.org/pipermail/freebsd-current/2018-March/068806.html

Post by O. Hartmann
This is from a APU, no ZFS, UFS on a small mSATA device, the APU (PCenigine) works
last pid: 9665; load averages: 0.13, 0.13, 0.11
up 3+06:53:55 00:26:26 19 processes: 1 running, 18 sleeping CPU: 0.3% user, 0.0%
nice, 0.2% system, 0.0% interrupt, 99.5% idle Mem: 27M Active, 6200K Inact, 83M
Laundry, 185M Wired, 128K Buf, 675M Free Swap: 7808M Total, 2856K Used, 7805M Free
[...]
The APU is running CURRENT ( FreeBSD 12.0-CURRENT #42 r330608: Wed Mar 7 16:55:59
CET 2018 amd64). Usually, the APU never(!) uses swap, now it is starting to swap
like hell for a couple of days and I have to reboot it failty often.

Yes, that is correct.

The system in question (the PCengine APU4C, articially limited to 1 GB RAM via boot
loader option) does run an asterisk PBX and is our Firewall/Router appliance. The
kernel/world is highly customized and "reduced" (via NanoBSD WITHOUT_ build options).

Since its existence mid of last year always running CURRENT since then, I realized only
that Asterisk might have a memory leak, but as some commits in the past suggests, it
also could be triggered by a bug in syslog. That's the background for having only 1 GB
out of 4GB configured.

Since a couple of weeks now, this APU starts swapping and keeping ~ 4GB of allocated swap,
right now, running CUURENT FreeBSD 12.0-CURRENT #50 r330750: Sun Mar 11 01:14:34 CET 2018
amd64:

last pid: 16958; load averages: 0.10, 0.21,
0.16
up 6+08:57:07 10:34:01 19 processes: 1 running, 18 sleeping CPU: 0.3% user, 0.0%
nice, 0.5% system, 0.0% interrupt, 99.2% idle Mem: 27M Active, 1504K Inact, 96M
Laundry, 188M Wired, 1900K Buf, 664M Free Swap: 7808M Total, 4204K Used, 7804M Free

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
997 asterisk 59 52 0 133M 61220K select 0 278:59 2.69% asterisk
16958 root 1 20 0 13308K 3448K CPU2 2 0:00 0.14% top
579 root 1 20 0 15252K 3116K select 1 35:01 0.07% ppp
933 root 1 20 0 10892K 1688K select 0 1:38 0.02% powerd
1038 root 1 20 0 11400K 764K nanslp 1 0:03 0.02% cron
930 root 1 20 0 18200K 18280K select 0 0:47 0.01% ntpd
1005 root 1 20 0 14772K 4516K bpf 3 0:13 0.00% arpwatch
834 root 1 20 0 11364K 1992K select 1 4:46 0.00% syslogd
847 bind 7 52 0 59528K 30600K sigwai 2 1:30 0.00% named
989 root 1 20 0 32264K 1716K nanslp 2 0:20 0.00% perl
863 daemon 1 20 0 11388K 1944K select 1 0:01 0.00% rpcbind
872 root 1 20 0 11096K 1840K autofs 2 0:01 0.00% automountd
975 root 1 20 0 14548K 0K nanslp 3 0:00 0.00% <smartd>
968 dhcpd 1 20 0 22972K 7648K select 0 0:00 0.00% dhcpd
878 root 1 20 0 10988K 0K kqread 1 0:00 0.00% <autounmountd>
13271 root 1 20 0 12104K 3128K wait 1 0:00 0.00% login
16955 root 1 26 0 13204K 4080K pause 3 0:00 0.00% csh
1034 root 1 20 0 18340K 3976K select 0 0:00 0.00% sshd
1084 root 1 52 0 10928K 1724K ttyin 2 0:00 0.00% getty

This box doesn't have ZFS! There is a small mSATA device UFS2 formatted for loggin and
automountfs usage.

I know this "repeated" report could annoy, but maybe it gives some more
informations/confirmations according to have a time series of incidents.

Post by Jeff Roberson

Post by Mark Millard
Unless this is unrelated, it would suggest that ZFS and its ARC need not
be involved.
Would what you are investigating relative to your "NUMA and concurrency
related work" fit with such a non-ZFS (no-ARC) context?

I think there are probably two different bugs. I believe the pid
controller has caused the laundry thread to start being more aggressive
causing more pageouts which would cause increased swap consumption.
The back-pressure mechanisms in arch should've resolved the other reports.
It's possible that I broke those. Although if the reports from 11.x are
to be believed I don't know that it was me. It is possible they have been
broken at different times for different reasons. So I will continue to
look.
Thanks,
Jeff

Post by Mark Millard
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

--
O. Hartmann

Ich widerspreche der Nutzung oder Ãbermittlung meiner Daten fÃŒr
Werbezwecke oder fÃŒr die Markt- oder Meinungsforschung (Â§ 28 Abs. 4 BDSG).

Mark Millard

2018-03-17 16:51:31 UTC

On 2018-Mar-17, at 2:38 AM, O. Hartmann <ohartmann at walstatt.org> wrote:

. . .

Post by O. Hartmann
last pid: 16958; load averages: 0.10, 0.21,
0.16
up 6+08:57:07 10:34:01 19 processes: 1 running, 18 sleeping CPU: 0.3% user, 0.0%
nice, 0.5% system, 0.0% interrupt, 99.2% idle Mem: 27M Active, 1504K Inact, 96M
Laundry, 188M Wired, 1900K Buf, 664M Free Swap: 7808M Total, 4204K Used, 7804M Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
997 asterisk 59 52 0 133M 61220K select 0 278:59 2.69% asterisk
16958 root 1 20 0 13308K 3448K CPU2 2 0:00 0.14% top
579 root 1 20 0 15252K 3116K select 1 35:01 0.07% ppp
933 root 1 20 0 10892K 1688K select 0 1:38 0.02% powerd
1038 root 1 20 0 11400K 764K nanslp 1 0:03 0.02% cron
930 root 1 20 0 18200K 18280K select 0 0:47 0.01% ntpd
1005 root 1 20 0 14772K 4516K bpf 3 0:13 0.00% arpwatch
834 root 1 20 0 11364K 1992K select 1 4:46 0.00% syslogd
847 bind 7 52 0 59528K 30600K sigwai 2 1:30 0.00% named
989 root 1 20 0 32264K 1716K nanslp 2 0:20 0.00% perl
863 daemon 1 20 0 11388K 1944K select 1 0:01 0.00% rpcbind
872 root 1 20 0 11096K 1840K autofs 2 0:01 0.00% automountd
975 root 1 20 0 14548K 0K nanslp 3 0:00 0.00% <smartd>
968 dhcpd 1 20 0 22972K 7648K select 0 0:00 0.00% dhcpd
878 root 1 20 0 10988K 0K kqread 1 0:00 0.00% <autounmountd>
13271 root 1 20 0 12104K 3128K wait 1 0:00 0.00% login
16955 root 1 26 0 13204K 4080K pause 3 0:00 0.00% csh
1034 root 1 20 0 18340K 3976K select 0 0:00 0.00% sshd
1084 root 1 52 0 10928K 1724K ttyin 2 0:00 0.00% getty

I'll note that top was a -w that reports:

-w Display approximate swap usage for each process.

It can also sort the list by swap usage and
can show more processes (In case they are the
primary users of the sswap space.)

Something like:

top -CawSoswap

might show interesting information about what
is using swap space.

Going in another direction. . .

With only 1 GiByte of RAM and well over 7 GiByte of swap
(7808M), I wonder if your boot reports something like:

warning: total configured swap (??? pages) exceeds maximum recommended amount (??? pages).

If so, then quoting "man 8 loader" and its kern.maxswzone material,

Note that swap metadata can be fragmented, which means that
the system can run out of space before it reaches the
theoretical limit. Therefore, care should be taken to not
configure more swap than approximately half of the
theoretical maximum.

is what that warning is about: Looking at the swapon_check_swzone code the
warning is reporting the "half" figure as "recommended", not reporting the
theoretical maximum. So, translating: "care should be taken to not
configure more swap than" the reported maximum recommended amount. (If I
understand correctly.) [These notes are from an older message for a
different context.]

I'll note that an RPI2 V1.1 and a RPI3, both with 1 GiByte of RAM,
get very different recommended figures (the text is copied from
an 2017-Dec-06 message that was probably for running what at the
time was a somwh,a old kernel, not a then-recent boot):

rpi2: . . . exceeds maximum recommended amount (411488 pages).
rpi3: . . . exceeds maximum recommended amount (925680 pages).

Pages are 4 KiBytes. Even the larger figure (RPI3) ends up
with only 925680*4KiBytes*(1MiByte/1024KiByte) == 3615.9375
MiByte recommended. In other words: far less than 7808M.

But your machine may well have a larger recommendation.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

Andriy Gapon

2018-03-17 18:26:58 UTC

Post by Mark Millard
-w Display approximate swap usage for each process.

As far as I can tell, this option is quite broken.
The "approximate swap usage" it reports is nowhere like it.

--
Andriy Gapon

Mark Millard

2018-03-17 19:44:18 UTC

Post by Mark Millard
-w Display approximate swap usage for each process.

As far as I can tell, this option is quite broken.
The "approximate swap usage" it reports is nowhere like it.

Too bad. Do you know if it is so messed up that the
apparent order of "uses more" vs. "uses less" would be
wrong when the difference in reported figures is fairly
large? (I'd avoid assuming an order for sufficiently
small differences [which still might be fairly large].)

Do you know if the system-wide figures from the summary
line:

Swap: 61G Total, 61G Free
(could also display an in-use figure)

are also broken as far as in-use would go? Should
top just be avoided for most swap-in-use information?

More overall, if anyone knows of such: Is there a
place to get reasonable swap-in-use information,
per process and/or system-wide?

One thing I've wished for is what would be a low bound
on the overall maximum-in-use figure (system wide), say
by checking periodically a reasonable in-use figure and
keeping track of (and reporting) the maximum-observed-so-far
figure. This kind of background information could be
used in choosing/adjusting a couple of poudriere-devel
parameters that control how much parallel activity
there can be.

(My local top implementation has an adjustment to also
display such a system-wide maximum-observed-swap-used
figure.)

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

O. Hartmann

2018-03-17 21:31:07 UTC

Am Sat, 17 Mar 2018 12:44:18 -0700
Mark Millard <marklmi26-***@yahoo.com> schrieb:

Tried on the APU:

[...]

last pid: 17910; load averages: 0.26, 0.16,
0.10
up 6+20:51:54 22:28:48 49 processes: 2 running, 46 sleeping, 1 waiting CPU: 0.5%
user, 0.0% nice, 0.4% system, 0.0% interrupt, 99.1% idle Mem: 27M Active, 3772K Inact,
98M Laundry, 186M Wired, 32K Buf, 661M Free Swap: 7808M Total, 4204K Used, 7804M Free

PID USERNAME THR PRI NICE SIZE RES SWAP STATE C TIME CPU COMMAND
17615 root 1 22 0 13204K 0K 4064K pause 1 0:00 0.00% -csh
(<csh>) 17002 root 1 20 0 12104K 0K 3108K wait 3 0:00 0.00%
login [pam] (<login>) 975 root 1 20 0 14548K 1260K 88K nanslp 3
0:00 0.00% /usr/local/sbin/smartd -c /usr/local/etc/smartd.conf -p /var/run/smartd.pid
989 root 1 20 0 32264K 8160K 28K nanslp 2 0:21 0.00% ddclient -
sleeping for 1149 seconds (perl) 11 root 4 155 ki31 0K 64K 0K
CPU0 0 652.1H 396.04% [idle] 997 asterisk 59 52 0 133M 62684K 0K
select 0 297:57 2.68% /usr/local/sbin/asterisk -n -F -U asterisk 0 root 26
-16 - 0K 416K 0K swapin 0 60:12 0.56% [kernel] 12 root 14
-52 - 0K 224K 0K WAIT 0 20:39 0.40% [intr] 17618 root 1
20 0 13044K 3472K 0K CPU3 3 0:34 0.12% top -CawSoswap 579 root
1 20 0 15252K 3116K 0K select 0 37:16 0.05% /usr/sbin/ppp -quiet -ddial
-unit0 o2vdsl2 19 root 1 -16 - 0K 16K 0K - 1 3:06
0.03% [rand_harvestq] 21 root 3 -16 - 0K 48K 0K psleep 3
1:21 0.02% [pagedaemon] 933 root 1 20 0 10892K 1688K 0K select
2 1:46 0.02% /usr/sbin/powerd

[...]

Sorry for the messy output ...

Post by Mark Millard

Post by Mark Millard
-w Display approximate swap usage for each process.

As far as I can tell, this option is quite broken.
The "approximate swap usage" it reports is nowhere like it.

Too bad. Do you know if it is so messed up that the
apparent order of "uses more" vs. "uses less" would be
wrong when the difference in reported figures is fairly
large? (I'd avoid assuming an order for sufficiently
small differences [which still might be fairly large].)
Do you know if the system-wide figures from the summary
Swap: 61G Total, 61G Free
(could also display an in-use figure)
are also broken as far as in-use would go? Should
top just be avoided for most swap-in-use information?
More overall, if anyone knows of such: Is there a
place to get reasonable swap-in-use information,
per process and/or system-wide?
One thing I've wished for is what would be a low bound
on the overall maximum-in-use figure (system wide), say
by checking periodically a reasonable in-use figure and
keeping track of (and reporting) the maximum-observed-so-far
figure. This kind of background information could be
used in choosing/adjusting a couple of poudriere-devel
parameters that control how much parallel activity
there can be.
(My local top implementation has an adjustment to also
display such a system-wide maximum-observed-swap-used
figure.)
===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)
_______________________________________________
https://lists.freebsd.org/mailman/listinfo/freebsd-current

--
O. Hartmann

Ich widerspreche der Nutzung oder Ãbermittlung meiner Daten fÃŒr
Werbezwecke oder fÃŒr die Markt- oder Meinungsforschung (Â§ 28 Abs. 4 BDSG).

Andriy Gapon

2018-03-17 21:39:41 UTC

Post by Mark Millard
Do you know if the system-wide figures from the summary
Swap: 61G Total, 61G Free
(could also display an in-use figure)
are also broken as far as in-use would go? Should
top just be avoided for most swap-in-use information?

I don't think that I have said anything to put these numbers in doubt.
I specifically talked about -w option.

--
Andriy Gapon

Mark Millard

2018-04-01 02:31:13 UTC

Post by Mark Millard
-w Display approximate swap usage for each process.

As far as I can tell, this option is quite broken.
The "approximate swap usage" it reports is nowhere like it.

I have a hypothesis for part of what top is
counting in the process/thread SWAP column
that might not be what one would expect.

It appears to me that vnode-backed pages are
being re-classfied sometimes for inactive
processes, and this classification leads to
top classifying the pages as not-resident
but swapped (in that a "VN PAGER in" would
be required, in systat -vmstat terms).

Supporting details, if you care, otherwise
skip the below:

The hypothesis is from observing various hours
of over 20 hours of poudriere-devel "bulk -a"
activity, for which, at that point:

vm.stats.vm.v_swappgsout: 0
vm.stats.vm.v_swappgsin: 0
vm.stats.vm.v_swapout: 0
vm.stats.vm.v_swapin: 0

vm.stats.vm.v_vnodepgsout: 6996
vm.stats.vm.v_vnodepgsin: 32641833
vm.stats.vm.v_vnodeout: 1030
vm.stats.vm.v_vnodein: 4305027

Sometimes top showed lots of wait/select/pause
and such with positive SWAP and (mostly) 0K RES.
Most of the processes were in the "wait" STATE.
At other times, more like between 1 and 2 dozen
had positive SWAP.

There would be sudden large jumps in the number
of such processes. Then over time it would
decrease as the processes quit waiting (children
process trees finished).

The large jumps were not tied to Free becoming
small or anything else obvious from what I was
looking at. But the Free figure would increase
at that time. For example, I recently saw such
a large jump that was associated with Free
increasing from "90G" as shown in top.
(Much of the time there were between, say, 170
and 300 sleeping processes.)

The context was under Hyper-V with 29 logical
processors assigned to FreeBSD on a machine
with 16 cores/32 threads and 114 GiBytes of
RAM assigned to FreeBSD (of 128 GiBytes) and
256 GiBytes of swap-partition set up.
PARALLEL_JOBS allowed the 29 and ALLOW_MAKE_JOBS
was in use (allowing a potential for 29*29=841
or so running processes via poudriere bulk).

For reference: at 25 hours-in [idle] had 148.3H
(around 20% of the 29 threads * 25 H/thread) and
[bufdaemon] had 48.9H (around 6.7%). [kernel]
showed around 13.6H (817 min converted) and
[pagedaemon] showed around 1.7H (101 min
converted). Other processes had less TIME than
any of these.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

Andriy Gapon

2018-04-04 17:16:58 UTC

Post by Mark Millard
I have a hypothesis for part of what top is
counting in the process/thread SWAP column
that might not be what one would expect.
It appears to me that vnode-backed pages are
being re-classfied sometimes for inactive
processes, and this classification leads to
top classifying the pages as not-resident
but swapped (in that a "VN PAGER in" would
be required, in systat -vmstat terms).

Not sure.
To me it seems that top just uses wrong statistics to calculate the value.

/* swap usage */
#define ki_swap(kip) \
((kip)->ki_swrss > (kip)->ki_rssize ? (kip)->ki_swrss - (kip)->ki_rssize : 0)

ki_rssize is the resident size of a process.
ki_swrss is resident set size before last swap.

Their difference is... exactly what?
I cannot even meaningfully describe this value.
But it is certainly _not_ the current swap utilization by the process.

Here is my attempt at obtaining a more reasonable approximation of the. process
swap use. But it is still wildly inaccurate.

diff --git a/usr.bin/top/machine.c b/usr.bin/top/machine.c
index 2d97d7f867f36..361a1542e6e16 100644
--- a/usr.bin/top/machine.c
+++ b/usr.bin/top/machine.c
@@ -233,12 +233,13 @@ static int carc_enabled;
static int pageshift; /* log base 2 of the pagesize */

/* define pagetok in terms of pageshift */
-
-#define pagetok(size) ((size) << pageshift)
+#define pagetok(size) ((size) << (pageshift - LOG1024))
+#define btopage(size) ((size) >> pageshift)

/* swap usage */
#define ki_swap(kip) \
- ((kip)->ki_swrss > (kip)->ki_rssize ? (kip)->ki_swrss - (kip)->ki_rssize : 0)
+ (btopage((kip)->ki_size) > (kip)->ki_rssize ? \
+ btopage((kip)->ki_size) - (kip)->ki_rssize : 0)

/* useful externals */
long percentages(int cnt, int *out, long *new, long *old, long *diffs);
@@ -384,9 +385,6 @@ machine_init(struct statics *statics, char do_unames)
pagesize >>= 1;
}

- /* we only need the amount of log(2)1024 for our conversion */
- pageshift -= LOG1024;
-
/* fill in the statics information */
statics->procstate_names = procstatenames;
statics->cpustate_names = cpustatenames;
@@ -1374,7 +1372,7 @@ static int sorted_state[] = {
} while (0)

#define ORDERKEY_SWAP(a, b) do { \
- int diff = (int)ki_swap(b) - (int)ki_swap(a); \
+ int diff = (long)ki_swap(b) - (long)ki_swap(a); \
if (diff != 0) \
return (diff > 0 ? 1 : -1); \
} while (0)

--
Andriy Gapon

Mark Millard

2018-04-05 02:03:43 UTC

If I get time this weekend, I'll try the patch. Thanks.

I've classically seen things like (picking on java here):
(no patch yet, so SWAP 0K shows)

PID USERNAME THR PRI NICE SIZE RES SWAP STATE C TIME CPU COMMAND
78694 root 44 52 0 14779M 92720K 0K uwait 22 0:06 9.91% [java]

when Swap: . . . 0 Used . . . (or some figure much
smaller than SIZE-RES) showed. (SIZE is ki_size and
RES is ki_rssize as I remember.) It suggests some
form of reserved-but-not-allocated contribution to
ki_size (SIZE), space not resident nor swapped out
to a swap partition. Possibly vnode-backed (potential
"VN PAGER in and out" contributions instead of "SWAP
PAGER" ones, in systat -vmstat terms)?

Are such cases examples of what you were counting
as "wildly inaccurate"? Or do you count vnode-backed
but not resident as perfectly good examples of SWAP
in use?

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

Andriy Gapon

2018-04-05 09:03:21 UTC

Post by Mark Millard
(no patch yet, so SWAP 0K shows)
PID USERNAME THR PRI NICE SIZE RES SWAP STATE C TIME CPU COMMAND
78694 root 44 52 0 14779M 92720K 0K uwait 22 0:06 9.91% [java]
when Swap: . . . 0 Used . . . (or some figure much
smaller than SIZE-RES) showed. (SIZE is ki_size and
RES is ki_rssize as I remember.) It suggests some
form of reserved-but-not-allocated contribution to
ki_size (SIZE), space not resident nor swapped out
to a swap partition. Possibly vnode-backed (potential
"VN PAGER in and out" contributions instead of "SWAP
PAGER" ones, in systat -vmstat terms)?
Are such cases examples of what you were counting
as "wildly inaccurate"? Or do you count vnode-backed
but not resident as perfectly good examples of SWAP
in use?

Apologies, but I didn't quite get your question...

Instead, I think I now understand better what top actually reports.
I think that it reports swap usage based on the archaic swapout algorithm where
a whole process is moved to swap when a memory shortage arises. And then the
process can be gradually swapped in. So, ki_swrss - ki_rssize is an estimate of
how much process's memory still left in swap (the difference between the
resident size before the full swapout and the current resident size).
Of course, now we have the modern swapout where individual inactive dirty
anonymous pages can be paged out to swap (the classic whole-process swapout
still can happen, but is quite rare), but top is not aware of that.

--
Andriy Gapon

Mark Millard

2018-03-19 00:54:30 UTC

bob prohaska fbsd at www.zefox.net wrote on

Post by Jeff Roberson
Also, if you could try going back to r328953 or r326346 and let me know if
the problem exists in either. That would be very helpful. If anyone is

Not sure this is relevant, but r326343 is able to run a j4 buildworld
to completion on an RPi3 with 3 gigs of microSD-based swap. There are
periods of seeming distress at times (lots of swread, pfault state
in top along with high %idle) in top, but the compilation completes.
In contrast, r328953 could not complete buildworld using -j4. Buildworld
would stop, usually reporting c++ killed, apparently for want of swap,
even though swap usage never exceeded about 30% accoring to top.
The machine employs UFS filesystems, . . .

Sounds like -r326346 would be an interesting kernel to test (the
next check-in on head after -r326343, one of Jeff's check-ins).

-r328953 was just before Jeff's:

Author: jeff
Date: Tue Feb 6 22:10:07 2018
New Revision: 328954
URL:
https://svnweb.freebsd.org/changeset/base/328954

Log:
Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.

So, if -r328953 behaves oddly and -r326343 does not, then the
question is if -r326346 makes the difference:

Author: jeff
Date: Tue Nov 28 23:18:35 2017
New Revision: 326346
URL:
https://svnweb.freebsd.org/changeset/base/326346

Log:
Move domain iterators into the page layer where domain selection should take
place. This makes the majority of the phys layer explicitly domain specific.

(Unfortunately my FreeBSD time is currently greatly limited.)

It is also interesting that your test context is UFS. O. Hartmann
has reported problems for UFS in a more modern version: -r330608 .

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

bob prohaska

2018-03-19 03:50:33 UTC

Post by Mark Millard
bob prohaska fbsd at www.zefox.net wrote on

Post by Jeff Roberson
Also, if you could try going back to r328953 or r326346 and let me know if
the problem exists in either. That would be very helpful. If anyone is

Not sure this is relevant, but r326343 is able to run a j4 buildworld
to completion on an RPi3 with 3 gigs of microSD-based swap. There are
periods of seeming distress at times (lots of swread, pfault state
in top along with high %idle) in top, but the compilation completes.
In contrast, r328953 could not complete buildworld using -j4. Buildworld
would stop, usually reporting c++ killed, apparently for want of swap,
even though swap usage never exceeded about 30% accoring to top.
The machine employs UFS filesystems, . . .

Sounds like -r326346 would be an interesting kernel to test (the
next check-in on head after -r326343, one of Jeff's check-ins).

My intent was to try 326346. Somehow 326343 arrived in its place.
Do architectures affect revision numbers? I thought not, but...

Post by Mark Millard
Author: jeff
Date: Tue Feb 6 22:10:07 2018
New Revision: 328954
https://svnweb.freebsd.org/changeset/base/328954
Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
So, if -r328953 behaves oddly and -r326343 does not, then the
Author: jeff
Date: Tue Nov 28 23:18:35 2017
New Revision: 326346
https://svnweb.freebsd.org/changeset/base/326346
Move domain iterators into the page layer where domain selection should take
place. This makes the majority of the phys layer explicitly domain specific.
(Unfortunately my FreeBSD time is currently greatly limited.)
It is also interesting that your test context is UFS. O. Hartmann
has reported problems for UFS in a more modern version: -r330608 .

When "out of swap" problems appeared I cobbled up a custom kernel,
in the hope that a smaller kernel might help. It has since developed
that the custom kernel can't boot, but GENERIC still boots. The system
is now running a j4 buildworld on r331153 with a GENERIC kernel
to see if maybe the problem has already been fixed in a way that
was obscured by the customization. The kernel config is at
http://www.zefox.net/~fbsd/rpi3/kernel_config/ZEFOX

Thanks for reading,

bob prohaska

bob prohaska

2018-03-19 16:57:01 UTC

Post by Mark Millard
bob prohaska fbsd at www.zefox.net wrote on

Post by Jeff Roberson
Also, if you could try going back to r328953 or r326346 and let me know if
the problem exists in either. That would be very helpful. If anyone is

Not sure this is relevant, but r326343 is able to run a j4 buildworld
to completion on an RPi3 with 3 gigs of microSD-based swap. There are
periods of seeming distress at times (lots of swread, pfault state
in top along with high %idle) in top, but the compilation completes.
In contrast, r328953 could not complete buildworld using -j4. Buildworld
would stop, usually reporting c++ killed, apparently for want of swap,
even though swap usage never exceeded about 30% accoring to top.
The machine employs UFS filesystems, . . .

Sounds like -r326346 would be an interesting kernel to test (the
next check-in on head after -r326343, one of Jeff's check-ins).

My intent was to try 326346. Somehow 326343 arrived in its place.
Do architectures affect revision numbers? I thought not, but...
When "out of swap" problems appeared I cobbled up a custom kernel,
in the hope that a smaller kernel might help. It has since developed
that the custom kernel can't boot, but GENERIC still boots. The system
is now running a j4 buildworld on r331153 with a GENERIC kernel

The -j4 buildworld using a GENERIC kernel crashed at the 1.6 MB point in the
logfile, which is the usual place for trouble. The debris collected is at

http://www.zefox.net/~fbsd/rpi3/crashes/20180319/

Being on an RPi3, I'm still uncertain whether the problems seen are connected
to the original subject. If they are not, please inform me.

Thanks for reading,

bob prohaska

Mark Millard

2018-04-01 20:01:20 UTC

Andriy Gapon avg at FreeBSD.org wrote on

Post by Andriy Gapon
First, it looks like maybe several different issues are being discussed and
possibly conflated in this thread.

It looks like one of the issues that contributes to non-ZFS contexts
seeing process-kills for out-of-swap when there is plenty left has
been identified in the code for head. See:

https://lists.freebsd.org/pipermail/svn-src-head/2018-April/111629.html

It looks like Mark Johnston will be checking something in to help
but there is more to do after that for what was identified.

Overall about the reporting:

Despite lots of hypothesizing beyond the evidence presented, I'm glad for
the reports of issues from multiple types of environments and to known of
types of contexts multiple folks have seen somewhat similar problems with.
Separating issues out can be difficult and time consuming. Knowing what
all needs eventual coverage helps guide things as progress is made. This
is true even for those that are just looking to pick up the eventual
fix(es) for the problem(s) they happen to have run into.

===
Mark Millard
marklmi26-fbsd at yahoo.com
( dsl-only.net went
away in early 2018-Mar)

Cy Schubert

2018-04-04 01:24:04 UTC

+1

However under certain circumstances it will release some memory. To reproduce, when bsdtar unpacks some tarballs (when building certain ports) tar will use 12 GB or more forcing ARC to release memory.

BTW, I haven't stopped to grok whether the bsdtar issue is local to me or another problem yet.

---
Sent using a tiny phone keyboard.
Apologies for any typos and autocorrect.
Also, this old phone only supports top post. Apologies.

Cy Schubert
<***@cschubert.com> or <***@freebsd.org>
The need of the many outweighs the greed of the few.
---

-----Original Message-----
From: Bryan Drewery
Sent: 23/03/2018 17:23
To: Peter Jeremy; Jeff Roberson
Cc: FreeBSD current; Andriy Gapon
Subject: Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

--
Regards,
Bryan Drewery

Larry Rosenman

2018-04-04 01:27:06 UTC

When my full backups run (1st Sunday -> Monday of the month) the box becomes unusable after 5-10 hours of that backup with LOTS of SWAP usage
And ARC using 100+G.

Is anyone looking into this?
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ***@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106
On 4/3/18, 8:24 PM, "Cy Schubert" <owner-freebsd-***@freebsd.org on behalf of ***@cschubert.com> wrote:

+1

However under certain circumstances it will release some memory. To reproduce, when bsdtar unpacks some tarballs (when building certain ports) tar will use 12 GB or more forcing ARC to release memory.

BTW, I haven't stopped to grok whether the bsdtar issue is local to me or another problem yet.

---
Sent using a tiny phone keyboard.
Apologies for any typos and autocorrect.
Also, this old phone only supports top post. Apologies.

Cy Schubert
<***@cschubert.com> or <***@freebsd.org>
The need of the many outweighs the greed of the few.
---

-----Original Message-----
From: Bryan Drewery
Sent: 23/03/2018 17:23
To: Peter Jeremy; Jeff Roberson
Cc: FreeBSD current; Andriy Gapon
Subject: Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

Looking at the ARC change you referred to from r325851
https://reviews.freebsd.org/D12163, I am convinced that ARC backpressure
is completely broken. On my 78GB RAM system with ARC limited to 40GB and
doing a poudriere build of all LLVM and GCC packages at once in tmpfs I
can get swap up near 50GB and yet the ARC remains at 40GB through it
all. It's always been slow to give up memory for package builds but it
really seems broken right now.

--
Regards,
Bryan Drewery

_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

Cy Schubert

2018-04-04 01:47:21 UTC

Agreed. I've come to the same conclusion that there may be multiple issues.

---
Sent using a tiny phone keyboard.
Apologies for any typos and autocorrect.
Also, this old phone only supports top post. Apologies.

Cy Schubert
<***@cschubert.com> or <***@freebsd.org>
The need of the many outweighs the greed of the few.
---

-----Original Message-----
From: Andriy Gapon
Sent: 27/03/2018 08:02
To: Bryan Drewery; Peter Jeremy; Jeff Roberson
Cc: FreeBSD current
Subject: Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

Post by Bryan Drewery
Looking at the ARC change you referred to from r325851
https://reviews.freebsd.org/D12163, I am convinced that ARC backpressure
is completely broken.

Does your being convinced come from the code review or from experiments?
If the former, could you please share your analysis?

--
Andriy Gapon
_______________________________________________
freebsd-***@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"

Cy Schubert

2018-04-04 01:53:46 UTC

Try arbitrarily reducing arc_max through sysctl. ARC is immediately reduced and free memory increased however wired pages remains the same.

---
Sent using a tiny phone keyboard.
Apologies for any typos and autocorrect.
Also, this old phone only supports top post. Apologies.

Cy Schubert
<***@cschubert.com> or <***@freebsd.org>
The need of the many outweighs the greed of the few.
---

-----Original Message-----
From: Larry Rosenman
Sent: 03/04/2018 19:27
To: Cy Schubert; Bryan Drewery; Peter Jeremy; Jeff Roberson
Cc: FreeBSD current; Andriy Gapon
Subject: Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

When my full backups run (1st Sunday -> Monday of the month) the box becomes unusable after 5-10 hours of that backup with LOTS of SWAP usage
And ARC using 100+G.

Is anyone looking into this?
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ***@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106
On 4/3/18, 8:24 PM, "Cy Schubert" <owner-freebsd-***@freebsd.org on behalf of ***@cschubert.com> wrote:

+1

However under certain circumstances it will release some memory. To reproduce, when bsdtar unpacks some tarballs (when building certain ports) tar will use 12 GB or more forcing ARC to release memory.

BTW, I haven't stopped to grok whether the bsdtar issue is local to me or another problem yet.

---
Sent using a tiny phone keyboard.
Apologies for any typos and autocorrect.
Also, this old phone only supports top post. Apologies.

Cy Schubert
<***@cschubert.com> or <***@freebsd.org>
The need of the many outweighs the greed of the few.
---

-----Original Message-----
From: Bryan Drewery
Sent: 23/03/2018 17:23
To: Peter Jeremy; Jeff Roberson
Cc: FreeBSD current; Andriy Gapon
Subject: Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

Larry Rosenman

2018-04-04 02:46:21 UTC

Thanks for that hint. I thought that arc_max was a tunable, but now knowing it's sysctl, that helps a lot.
--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ***@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106
On 4/3/18, 8:54 PM, "Cy Schubert" <owner-freebsd-***@freebsd.org on behalf of ***@cschubert.com> wrote:

Try arbitrarily reducing arc_max through sysctl. ARC is immediately reduced and free memory increased however wired pages remains the same.

---
Sent using a tiny phone keyboard.
Apologies for any typos and autocorrect.
Also, this old phone only supports top post. Apologies.

Cy Schubert
<***@cschubert.com> or <***@freebsd.org>
The need of the many outweighs the greed of the few.
---

-----Original Message-----
From: Larry Rosenman
Sent: 03/04/2018 19:27
To: Cy Schubert; Bryan Drewery; Peter Jeremy; Jeff Roberson
Cc: FreeBSD current; Andriy Gapon
Subject: Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

When my full backups run (1st Sunday -> Monday of the month) the box becomes unusable after 5-10 hours of that backup with LOTS of SWAP usage
And ARC using 100+G.

Is anyone looking into this?

--
Larry Rosenman http://www.lerctr.org/~ler
Phone: +1 214-642-9640 E-Mail: ***@lerctr.org
US Mail: 5708 Sabbia Drive, Round Rock, TX 78665-2106
On 4/3/18, 8:24 PM, "Cy Schubert" <owner-freebsd-***@freebsd.org on behalf of ***@cschubert.com> wrote:

+1

However under certain circumstances it will release some memory. To reproduce, when bsdtar unpacks some tarballs (when building certain ports) tar will use 12 GB or more forcing ARC to release memory.

BTW, I haven't stopped to grok whether the bsdtar issue is local to me or another problem yet.

---
Sent using a tiny phone keyboard.
Apologies for any typos and autocorrect.
Also, this old phone only supports top post. Apologies.

Cy Schubert
<***@cschubert.com> or <***@freebsd.org>
The need of the many outweighs the greed of the few.
---

-----Original Message-----
From: Bryan Drewery
Sent: 23/03/2018 17:23
To: Peter Jeremy; Jeff Roberson
Cc: FreeBSD current; Andriy Gapon
Subject: Re: Strange ARC/Swap/CPU on yesterday's -CURRENT

I ran into this on 11-stable and tracked it to r326619 (MFC of r325851).
I initially got around the problem by reverting that commit but either
it or something very similar is still present in 11-stable r331053.
I've seen it in my main server (32GB RAM) but haven't managed to reproduce
it in smaller VBox guests - one difficulty I faced was artificially filling
ARC.

Don Lewis

2018-04-04 04:45:13 UTC