crash on process exit.. current at about r332467

Discussion:

(too old to reply)

Bryan Drewery

2018-04-25 19:41:37 UTC

If anyone wants to take a look..
In the exit syscall, while deallocating a vm object.
I haven't see references to a similar crash in the last 10 days or so..
But if it rings any bells...

I just hit this on r332455 and have a dump.

panic: Bad link elm 0xfffff811cd920e60 prev->next != elm
cpuid = 10
time = 1524682939
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe23f450c3b0
vpanic() at vpanic+0x18d/frame 0xfffffe23f450c410
panic() at panic+0x43/frame 0xfffffe23f450c470
vm_object_destroy() at vm_object_destroy/frame 0xfffffe23f450c4d0
vm_object_deallocate() at vm_object_deallocate+0x45c/frame 0xfffffe23f450c530
vm_map_process_deferred() at vm_map_process_deferred+0x99/frame 0xfffffe23f450c560
vm_map_remove() at vm_map_remove+0xc6/frame 0xfffffe23f450c590
exec_new_vmspace() at exec_new_vmspace+0x185/frame 0xfffffe23f450c5f0
exec_elf64_imgact() at exec_elf64_imgact+0x8fb/frame 0xfffffe23f450c6e0
kern_execve() at kern_execve+0x82c/frame 0xfffffe23f450ca40
sys_execve() at sys_execve+0x4c/frame 0xfffffe23f450cac0
amd64_syscall() at amd64_syscall+0x786/frame 0xfffffe23f450cbf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe23f450cbf0
--- syscall (59, FreeBSD ELF64, sys_execve), rip = 0x800d7af7a, rsp = 0x7fffffffbd28, rbp = 0x7fffffffbe70 ---

It's a different stack than Julian's but it seems like the same issue to me.

--
Regards,
Bryan Drewery

Bryan Drewery

2018-04-25 19:39:47 UTC

Permalink

back trace at:Â http://www.freebsd.org/~julian/bob-crash.png
If anyone wants to take a look..
In the exit syscall, while deallocating a vm object.
I haven't see references to a similar crash in the last 10 days or so..
But if it rings any bells...

I just hit this on r332455 and have a dump.

--
Regards,
Bryan Drewery

Bryan Drewery

2018-04-25 20:35:20 UTC

Permalink

Post by Bryan Drewery

I just hit this on r332455 and have a dump.

It's a different stack than Julian's but it seems like the same issue to me.
#12 0xffffffff80b6d1c3 in panic (fmt=0xffffffff81dee958 <cnputs_mtx> "\322\237'\201\377\377\377\377") at /usr/src/sys/kern/kern_shutdown.c:764
#13 0xffffffff80eac060 in vm_object_terminate (object=0xfffff80587c4de00) at /usr/src/sys/vm/vm_object.c:868
#14 0xffffffff80eaaf2c in vm_object_deallocate (object=0xfffff80587c4de00) at /usr/src/sys/vm/vm_object.c:684
#15 0xffffffff80ea0089 in vm_map_entry_deallocate (system_map=<error reading variable: Cannot access memory at address 0x0>, entry=<optimized out>) at /usr/src/sys/vm/vm_map.c:2997
#16 vm_map_process_deferred () at /usr/src/sys/vm/vm_map.c:541
#17 0xffffffff80ea5186 in _vm_map_unlock (map=<optimized out>, file=<optimized out>, line=3189) at /usr/src/sys/vm/vm_map.c:554
#18 vm_map_remove (map=<optimized out>, start=4096, end=140737488355328) at /usr/src/sys/vm/vm_map.c:3189
#19 0xffffffff80b24c35 in exec_new_vmspace (imgp=0xfffffe23f450c8b0, sv=<optimized out>) at /usr/src/sys/kern/kern_exec.c:1099
#20 0xffffffff80afaf1b in exec_elf64_imgact (imgp=<optimized out>) at /usr/src/sys/kern/imgact_elf.c:922
#21 0xffffffff80b2380c in do_execve (td=<optimized out>, args=<optimized out>, mac_p=<optimized out>) at /usr/src/sys/kern/kern_exec.c:606
#22 kern_execve (td=<optimized out>, args=<optimized out>, mac_p=<optimized out>) at /usr/src/sys/kern/kern_exec.c:351
#23 0xffffffff80b22c9c in sys_execve (td=0xfffff801493c6000, uap=0xfffff801493c63c0) at /usr/src/sys/kern/kern_exec.c:225
(kgdb) frame 13
#13 0xffffffff80eac060 in vm_object_terminate (object=0xfffff80587c4de00) at /usr/src/sys/vm/vm_object.c:868
868 KASSERT(object->cred == NULL || object->type == OBJT_DEFAULT ||
(kgdb) p *object
$2 = {lock = {lock_object = {lo_name = 0xffffffff81214cff "vm object", lo_flags = 627245056, lo_data = 0, lo_witness = 0xfffff8123fd6a700}, rw_lock = 18446735283140190208}, object_list = {
tqe_next = 0xfffff80587c67000, tqe_prev = 0xfffff80587c4dd20}, shadow_head = {lh_first = 0x0}, shadow_list = {le_next = 0xfffff80a9606c500, le_prev = 0xfffff8054f0c2c30}, memq = {
tqh_first = 0xfffff811d00c9980, tqh_last = 0xfffff811d850b8a0}, rtree = {rt_root = 18446735322516082688}, size = 2048, domain = {dr_policy = 0x0, dr_iterator = 0}, generation = 1, ref_count = 0,
shadow_count = 0, memattr = 6 '\006', type = 0 '\000', flags = 12296, pg_color = 1024, paging_in_progress = 0, resident_page_count = 5, backing_object = 0x0, backing_object_offset = 0,
pager_object_list = {tqe_next = 0x0, tqe_prev = 0x0}, rvq = {lh_first = 0xfffff811cbac2240}, handle = 0x0, un_pager = {vnp = {vnp_size = 0, writemappings = 0}, devp = {devp_pglist = {
tqh_first = 0x0, tqh_last = 0x0}, ops = 0x0, dev = 0x0}, sgp = {sgp_pglist = {tqh_first = 0x0, tqh_last = 0x0}}, swp = {swp_tmpfs = 0x0, swp_blks = {pt_root = 0}}}, cred = 0xfffff807ebd91500,
charge = 8388608, umtx_data = 0x0}

object->type = OBJT_DEFAULT.

Er I'm not sure what's going on here as line 868 is a totally different
assert than the queue(3) one in the msgbuf...

Post by Bryan Drewery
868 KASSERT(object->cred == NULL || object->type == OBJT_DEFAULT ||
869 object->type == OBJT_SWAP,
870 ("%s: non-swap obj %p has cred", __func__, object));

--
Regards,
Bryan Drewery

Andriy Gapon

2018-05-29 13:50:14 UTC

Permalink

back trace at: http://www.freebsd.org/~julian/bob-crash.png
If anyone wants to take a look..
In the exit syscall, while deallocating a vm object.
I haven't see references to a similar crash in the last 10 days or so.. But if
it rings any bells...

We have just got another one:
panic: Bad link elm 0xfffff80cc3938360 prev->next != elm

Matching disassembled code to C code, it seems that the crash is somewhere in
vm_object_terminate_pages (inlined into vm_object_terminate), probably in one of
TAILQ_REMOVE-s there:
if (p->queue != PQ_NONE) {
KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: "
"page %p is not queued", p));
pq1 = vm_page_pagequeue(p);
if (pq != pq1) {
if (pq != NULL) {
vm_pagequeue_cnt_add(pq, dequeued);
vm_pagequeue_unlock(pq);
}
pq = pq1;
vm_pagequeue_lock(pq);
dequeued = 0;
}
p->queue = PQ_NONE;
TAILQ_REMOVE(&pq->pq_pl, p, plinks.q);
dequeued--;
}
if (vm_page_free_prep(p, true))
continue;
unlist:
TAILQ_REMOVE(&object->memq, p, listq);
}

Please note that this is the code before r332974 Improve VM page queue scalability.
I am not sure if r332974 + r333256 would fix the problem or if it just would get
moved to a different place.

Does this ring a bell to anyone who tinkered with that part of the VM code recently?

Looking a little bit further, I think that object->memq somehow got corrupted.
memq contains just two elements and the reported element is not there.

(kgdb) p *(struct vm_page *)0xfffff80cc3938360
$22 = {
plinks = {
q = {
tqe_next = 0xfffff80cd7175398,
tqe_prev = 0xfffff80cb9f69170
},
s = {
ss = {
sle_next = 0xfffff80cd7175398
},
pv = 0xfffff80cb9f69170
},
memguard = {
p = 18446735332764767128,
v = 18446735332276081008
}
},
listq = {
tqe_next = 0xfffff80cc3938568, <=============
tqe_prev = 0xfffff8078c11b848 <=============
},
object = 0x0,
pindex = 1548,
phys_addr = 14695911424,
md = {
pv_list = {
tqh_first = 0x0,
tqh_last = 0xfffff80cc3938398
},
pv_gen = 1205766,
pat_mode = 6
},
wire_count = 0,
busy_lock = 1,
hold_count = 0,
flags = 0,
aflags = 0 '\000',
oflags = 0 '\000',
queue = 255 '\377',
psind = 0 '\000',
segind = 5 '\005',
order = 13 '\r',
pool = 0 '\000',
act_count = 5 '\005',
valid = 0 '\000',
dirty = 0 '\000'
}

(kgdb) p object->memq
$11 = {
tqh_first = 0xfffff80cb861cfb8,
tqh_last = 0xfffff80cc3938780
}

(kgdb) p *object->memq.tqh_first
$25 = {
plinks = {
q = {
tqe_next = 0xfffff80cb9f69108,
tqe_prev = 0xfffff80cd7175398
},
s = {
ss = {
sle_next = 0xfffff80cb9f69108
},
pv = 0xfffff80cd7175398
},
memguard = {
p = 18446735332276080904,
v = 18446735332764767128
}
},
listq = {
tqe_next = 0xfffff80cb56eafb0, <=============
tqe_prev = 0xfffff8078c11b848 <=============
},
object = 0xfffff8078c11b800,
pindex = 515,
phys_addr = 7299219456,
md = {
pv_list = {
tqh_first = 0xfffff80b99e4ff88,
tqh_last = 0xfffff80b99e4ff90
},
pv_gen = 466177,
pat_mode = 6
},
wire_count = 0,
busy_lock = 2,
hold_count = 0,
flags = 0,
aflags = 0 '\000',
oflags = 0 '\000',
queue = 255 '\377',
psind = 0 '\000',
segind = 5 '\005',
order = 13 '\r',
pool = 0 '\000',
act_count = 5 '\005',
valid = 255 '\377',
dirty = 0 '\000'
}
(kgdb) p *object->memq.tqh_first->listq.tqe_next
$26 = {
plinks = {
q = {
tqe_next = 0x0,
tqe_prev = 0xfffff80cc92e1d18
},
s = {
ss = {
sle_next = 0x0
},
pv = 0xfffff80cc92e1d18
},
memguard = {
p = 0,
v = 18446735332531379480
}
},
listq = {
tqe_next = 0x0, <=============
tqe_prev = 0xfffff80cb861cfc8 <=============
},
object = 0xfffff8078c11b800,
pindex = 1548,
phys_addr = 5350158336,
md = {
pv_list = {
tqh_first = 0xfffff80a07222808,
tqh_last = 0xfffff80a07222810
},
pv_gen = 7085,
pat_mode = 6
},
wire_count = 0,
busy_lock = 1,
hold_count = 0,
flags = 0,
aflags = 1 '\001',
oflags = 0 '\000',
queue = 1 '\001',
psind = 0 '\000',
segind = 5 '\005',
order = 13 '\r',
pool = 0 '\000',
act_count = 5 '\005',
valid = 255 '\377',
dirty = 255 '\377'
}

Pages 0xfffff80cc3938360 (the reported one) and 0xfffff80cb56eafb0 (the last one
on memq) have the same index 1548. Also, memq.tqh_last points to the reported
page, but it is not reachable via tqe_next pointers.
It's also potentially interesting is that the reported page looks like it's
already freed and the replacement page is both valid and dirty.

The object, just in case:
(kgdb) p *object
$34 = {
lock = {
lock_object = {
lo_name = 0xffffffff81202c27 "vm object",
lo_flags = 627245056,
lo_data = 0,
lo_witness = 0xfffff80cffd6a700
},
rw_lock = 18446735286009226592
},
object_list = {
tqe_next = 0xfffff80b2481e200,
tqe_prev = 0xfffff80b2481e020
},
shadow_head = {
lh_first = 0x0
},
shadow_list = {
le_next = 0xfffff809c070f900,
le_prev = 0xfffff80869c06c30
},
memq = {
tqh_first = 0xfffff80cb861cfb8,
tqh_last = 0xfffff80cc3938780
},
rtree = {
rt_root = 18446735279843613792
},
size = 1561,
domain = {
dr_policy = 0x0,
dr_iterator = 0
},
generation = 1,
ref_count = 0,
shadow_count = 0,
memattr = 6 '\006',
type = 0 '\000',
flags = 12296,
pg_color = 1809,
paging_in_progress = 0,
resident_page_count = 5,
backing_object = 0x0,
backing_object_offset = 0,
pager_object_list = {
tqe_next = 0x0,
tqe_prev = 0x0
},
rvq = {
lh_first = 0xfffff80cad278b60
},
handle = 0x0,
un_pager = {
vnp = {
vnp_size = 19444,
writemappings = 0
},
devp = {
devp_pglist = {
tqh_first = 0x4bf4,
tqh_last = 0x0
},
ops = 0x0,
dev = 0x0
},
sgp = {
sgp_pglist = {
tqh_first = 0x4bf4,
tqh_last = 0x0
}
},
swp = {
swp_tmpfs = 0x4bf4,
swp_blks = {
pt_root = 0
}
}
},
cred = 0xfffff806811adc00,
charge = 6393856,
umtx_data = 0x0
}

Interesting that it is on a shadow list.

--
Andriy Gapon

Mark Johnston

2018-05-29 16:22:17 UTC

Permalink

Post by Andriy Gapon

panic: Bad link elm 0xfffff80cc3938360 prev->next != elm
Matching disassembled code to C code, it seems that the crash is somewhere in
vm_object_terminate_pages (inlined into vm_object_terminate), probably in one of
if (p->queue != PQ_NONE) {
KASSERT(p->queue < PQ_COUNT, ("vm_object_terminate: "
"page %p is not queued", p));
pq1 = vm_page_pagequeue(p);
if (pq != pq1) {
if (pq != NULL) {
vm_pagequeue_cnt_add(pq, dequeued);
vm_pagequeue_unlock(pq);
}
pq = pq1;
vm_pagequeue_lock(pq);
dequeued = 0;
}
p->queue = PQ_NONE;
TAILQ_REMOVE(&pq->pq_pl, p, plinks.q);
dequeued--;
}
if (vm_page_free_prep(p, true))
continue;
TAILQ_REMOVE(&object->memq, p, listq);
}
Please note that this is the code before r332974 Improve VM page queue scalability.
I am not sure if r332974 + r333256 would fix the problem or if it just would get
moved to a different place.
Does this ring a bell to anyone who tinkered with that part of the VM code recently?

This doesn't look familiar to me and I doubt that r332974 fixed the
underlying problem, whatever it is.

Post by Andriy Gapon
Looking a little bit further, I think that object->memq somehow got corrupted.
memq contains just two elements and the reported element is not there.

Based on the debugging session, it would be interesting to know if there
were any other threads somehow manipulating the (dead) object at the
time of the panic.

Among the panics that you observed, is it the same application that is
causing the crash in each case?

Andriy Gapon

2018-05-29 16:38:19 UTC

Permalink

Post by Mark Johnston

Post by Andriy Gapon

This doesn't look familiar to me and I doubt that r332974 fixed the
underlying problem, whatever it is.

I see.

Post by Mark Johnston

Post by Andriy Gapon
Looking a little bit further, I think that object->memq somehow got corrupted.
memq contains just two elements and the reported element is not there.

Based on the debugging session, it would be interesting to know if there
were any other threads somehow manipulating the (dead) object at the
time of the panic.

I will check for this.

Post by Mark Johnston
Among the panics that you observed, is it the same application that is
causing the crash in each case?

I have two crash dumps right now and in both cases it's sh exec-ing grep.
But I cannot imagine what could be so special about that.
Actually, I see that the shell ran a long pipeline with many grep-s in it, so
there were many exec-s and exits of grep, perhaps some of them concurrent.

--
Andriy Gapon

Continue reading on narkive:

Search results for 'crash on process exit.. current at about r332467' (Questions and Answers)

replies

What was the main cause for the 2008 stock market crash?

started 2008-12-07 15:04:07 UTC

investing

replies

What is the relationship between 911 and Saddam?

started 2006-12-05 09:15:14 UTC

politics & government

replies