Discussion:
GPTZFSBOOT in Current r326622 has problems
(too old to reply)
Warner Losh
2017-12-06 15:54:42 UTC
Permalink
I updated my amd64 computer today to r326622 and copied the
/boot/gptzfsboot file to each of my ZFS hard drives p1 partition. The
BTX loader stopped and could not load. This rendered my system
'un-bootable'. I copied this file from an earlier live filesystem CD,
which restored my computer and enabled me to boot.
Any chance you can bisect when this happened? I think I'll need more
details to see what happened. What was your old loader that world based on?
I also see the problem with the loader logs dumping core that others
have reported recently.
I've not seen these reports, nor do I see if on my loader testing. More
details please? Last I had heard, everything was working...

Warner
Thomas Laus
2017-12-06 16:48:01 UTC
Permalink
Post by Warner Losh
Any chance you can bisect when this happened? I think I'll need more
details to see what happened. What was your old loader that world based on?
My last good gptzfsboot was r326070. I had not built anything since
then until this morning when I built world and kernel at r326622.
I'll look at the svn history and see where I should start.
Post by Warner Losh
I also see the problem with the loader logs dumping core that others
have reported recently.
I've not seen these reports, nor do I see if on my loader testing. More
details please? Last I had heard, everything was working...
That was a typo. I had loader on my mind and the core dumps are
dealing with logger. I have seen recent messages to this group about
logger dumping core and someone is bisecting the issue. I was just
reporting that it was still occurring at r326622.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Warner Losh
2017-12-06 17:28:20 UTC
Permalink
Post by Thomas Laus
Post by Warner Losh
Any chance you can bisect when this happened? I think I'll need more
details to see what happened. What was your old loader that world based
on?
My last good gptzfsboot was r326070. I had not built anything since
then until this morning when I built world and kernel at r326622.
I'll look at the svn history and see where I should start.
I've been *VERY* busy between then and now cleaning up the boot loader
"accumulated technical debt". Alas, sounds like I've broken something. So I
think it's a binary search: I'd start with 326370 as the pivot and 326500 /
326250 as the next steps if it succeeds / fails.

So you are seeing a BTX error? Before we even get to running /boot/loader,
correct? Any chance you can get me that error?
Post by Thomas Laus
Post by Warner Losh
I also see the problem with the loader logs dumping core that others
have reported recently.
I've not seen these reports, nor do I see if on my loader testing. More
details please? Last I had heard, everything was working...
That was a typo. I had loader on my mind and the core dumps are
dealing with logger. I have seen recent messages to this group about
logger dumping core and someone is bisecting the issue. I was just
reporting that it was still occurring at r326622.
Good! One less problem for me to track...

Warner
Thomas Laus
2017-12-06 17:48:29 UTC
Permalink
Post by Warner Losh
I've been *VERY* busy between then and now cleaning up the boot loader
"accumulated technical debt". Alas, sounds like I've broken something. So I
think it's a binary search: I'd start with 326370 as the pivot and 326500 /
326250 as the next steps if it succeeds / fails.
So you are seeing a BTX error? Before we even get to running /boot/loader,
correct? Any chance you can get me that error?
I get a screen full of register and hex numbers followed by BTX
halted. I can do my best to transcribe all of the numbers, but that
may be very error prone to do by hand. At this point in the boot
process, it is a little hard to fire up a serial console and capture
the output to a file. I run Geli encrypted disks and the BTX halt
comes even before the prompt for a password.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Thomas Laus
2017-12-06 22:21:55 UTC
Permalink
Post by Warner Losh
I've been *VERY* busy between then and now cleaning up the boot loader
"accumulated technical debt". Alas, sounds like I've broken something. So I
think it's a binary search: I'd start with 326370 as the pivot and 326500 /
326250 as the next steps if it succeeds / fails.
Warren:

I reverted my system to r326370 and it booted normally. It looks like
something broke after that revision. I'll cue up another buildworld
after updating the system to r326500.

I received a couple of suggestions to take a picture and post my
console screen of the BTX fault. My camera battery has died and is on
the charger for a while. I'll build r326500 and post a picture if it
shows the same BTX issue.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Thomas Laus
2017-12-07 00:17:20 UTC
Permalink
You can just build the boot blocks at each step if you'd like to save some
time on the binary search.
cd stand
make cleandir obj depend
make -j XX
sudo -E make install
Warren:

I built and loaded r326500 successfully. It looks like the problem is
after r326500 and before r326662.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Warner Losh
2017-12-07 00:38:21 UTC
Permalink
Post by Thomas Laus
You can just build the boot blocks at each step if you'd like to save
some
time on the binary search.
cd stand
make cleandir obj depend
make -j XX
sudo -E make install
I built and loaded r326500 successfully. It looks like the problem is
after r326500 and before r326662.
OK. Still a fair number of changes, including changes to geli to fix
warnings...

326585-326594 is a flurry of changes. Then another in the 326609-326610
range. There's one other trivial one. I'd wager that if '500 works, the
breakage will be somewhere in the first range, which suggests 326590 might
be a good, next pivot. There's also a few just after '500 that might break
things as well if I messed something up. '504 and '507 both touch this
stuff directly...

Good luck!

Warner
Thomas Laus
2017-12-07 01:22:41 UTC
Permalink
Post by Warner Losh
You can just build the boot blocks at each step if you'd like to save
some time on the binary search.
cd stand
make cleandir obj depend
make -j XX
sudo -E make install
OK. Still a fair number of changes, including changes to geli to fix
warnings...
326585-326594 is a flurry of changes. Then another in the 326609-326610
range. There's one other trivial one. I'd wager that if '500 works, the
breakage will be somewhere in the first range, which suggests 326590 might
be a good, next pivot. There's also a few just after '500 that might break
things as well if I messed something up. '504 and '507 both touch this
stuff directly...
Warren:

Building just 'stand' had an error in /usr/src/stand/geli concerning
geli hmac conflicting type for 'ngets'.

I am doing a full buildworld for r326590. It should be complete
before 9:30 PM EST. I'll post the results and look for replies tomorrow
morning and proceed with the disscetion.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Thomas Laus
2017-12-07 02:28:11 UTC
Permalink
Post by Thomas Laus
Post by Warner Losh
You can just build the boot blocks at each step if you'd like to save
some time on the binary search.
cd stand
make cleandir obj depend
make -j XX
sudo -E make install
OK. Still a fair number of changes, including changes to geli to fix
warnings...
326585-326594 is a flurry of changes. Then another in the 326609-326610
range. There's one other trivial one. I'd wager that if '500 works, the
breakage will be somewhere in the first range, which suggests 326590 might
be a good, next pivot. There's also a few just after '500 that might break
things as well if I messed something up. '504 and '507 both touch this
stuff directly...
Building just 'stand' had an error in /usr/src/stand/geli concerning
geli hmac conflicting type for 'ngets'.
I am doing a full buildworld for r326590. It should be complete
before 9:30 PM EST. I'll post the results and look for replies tomorrow
morning and proceed with the disscetion.
Warren:

The r326590 buildworld failed due to a problem in the 'geli' code.
The specific error:

/usr/src/sys/geom/eli/g-eli_hmac.c line 46. It complained about a
missing type specifier.

I'll try this again tomorrow morning from an xterm so I can capture
and post the exact message.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Thomas Laus
2017-12-07 12:44:51 UTC
Permalink
Post by Warner Losh
OK. Still a fair number of changes, including changes to geli to fix
warnings...
326585-326594 is a flurry of changes. Then another in the 326609-326610
range. There's one other trivial one. I'd wager that if '500 works, the
breakage will be somewhere in the first range, which suggests 326590
might be a good, next pivot. There's also a few just after '500 that
might break things as well if I messed something up. '504 and '507 both
touch this stuff directly...
Warren:

I reverted my system back to r326585 and 'stand' still won't compile; I
get this output:

--- g_eli_hmac.o ---
In file included from /usr/src/sys/geom/eli/g_eli_hmac.c:46:
In file included from /usr/src/sys/geom/eli/g_eli.h:49:
/usr/include/stdio.h:267:12: error: type specifier missing, defaults to
'int' [-Werror,-Wimplicit-int]
char *gets(char *);
^
/usr/include/stdio.h:267:7: error: expected parameter declarator
char *gets(char *);
^
/usr/src/stand/libsa/stand.h:271:28: note: expanded from macro 'gets'
#define gets(x) ngets((x), 0)
^
In file included from /usr/src/sys/geom/eli/g_eli_hmac.c:46:
In file included from /usr/src/sys/geom/eli/g_eli.h:49:
/usr/include/stdio.h:267:7: error: expected ')'
/usr/src/stand/libsa/stand.h:271:28: note: expanded from macro 'gets'
#define gets(x) ngets((x), 0)
^
/usr/include/stdio.h:267:7: note: to match this '('
/usr/src/stand/libsa/stand.h:271:22: note: expanded from macro 'gets'
#define gets(x) ngets((x), 0)
^
In file included from /usr/src/sys/geom/eli/g_eli_hmac.c:46:
In file included from /usr/src/sys/geom/eli/g_eli.h:49:
/usr/include/stdio.h:267:7: error: conflicting types for 'ngets'
char *gets(char *);
^
/usr/src/stand/libsa/stand.h:271:17: note: expanded from macro 'gets'
#define gets(x) ngets((x), 0)
^
/usr/src/stand/libsa/stand.h:270:13: note: previous declaration is here
extern void ngets(char *, int);
^
In file included from /usr/src/sys/geom/eli/g_eli_hmac.c:46:
In file included from /usr/src/sys/geom/eli/g_eli.h:49:
/usr/include/stdio.h:271:6: error: conflicting types for 'putchar'
int putchar(int);
^
/usr/src/stand/libsa/stand.h:382:14: note: previous declaration is here
extern void putchar(int);
^
In file included from /usr/src/sys/geom/eli/g_eli_hmac.c:46:
In file included from /usr/src/sys/geom/eli/g_eli.h:49:
/usr/include/stdio.h:286:6: error: conflicting types for 'vprintf'
int vprintf(const char * __restrict, __va_list);
^
/usr/src/stand/libsa/stand.h:262:13: note: previous declaration is here
extern void vprintf(const char *fmt, __va_list);
^
In file included from /usr/src/sys/geom/eli/g_eli_hmac.c:46:
/usr/src/stand/libsa/stand.h:265:13: note: previous declaration is here
extern void vsprintf(char *buf, const char *cfmt, __va_list);
^
7 errors generated.
*** [g_eli_hmac.o] Error code 1

make[1]: stopped in /usr/src/stand/geli
1 error

make[1]: stopped in /usr/src/stand/geli
*** [all_subdir_geli] Error code 2

make: stopped in /usr/src/stand
1 error

make: stopped in /usr/src/stand


Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Warner Losh
2017-12-08 20:08:57 UTC
Permalink
Looks like -DEFI_ZFS_BOOT was dropped from boot1.c in r326589. I've fixed
it in r326714.

Warner
Post by Thomas Laus
Post by Warner Losh
OK. Still a fair number of changes, including changes to geli to fix
warnings...
326585-326594 is a flurry of changes. Then another in the 326609-326610
range. There's one other trivial one. I'd wager that if '500 works, the
breakage will be somewhere in the first range, which suggests 326590
might be a good, next pivot. There's also a few just after '500 that
might break things as well if I messed something up. '504 and '507 both
touch this stuff directly...
I reverted my system back to r326585 and 'stand' still won't compile; I
--- g_eli_hmac.o ---
/usr/include/stdio.h:267:12: error: type specifier missing, defaults to
'int' [-Werror,-Wimplicit-int]
char *gets(char *);
^
/usr/include/stdio.h:267:7: error: expected parameter declarator
char *gets(char *);
^
/usr/src/stand/libsa/stand.h:271:28: note: expanded from macro 'gets'
#define gets(x) ngets((x), 0)
^
/usr/include/stdio.h:267:7: error: expected ')'
/usr/src/stand/libsa/stand.h:271:28: note: expanded from macro 'gets'
#define gets(x) ngets((x), 0)
^
/usr/include/stdio.h:267:7: note: to match this '('
/usr/src/stand/libsa/stand.h:271:22: note: expanded from macro 'gets'
#define gets(x) ngets((x), 0)
^
/usr/include/stdio.h:267:7: error: conflicting types for 'ngets'
char *gets(char *);
^
/usr/src/stand/libsa/stand.h:271:17: note: expanded from macro 'gets'
#define gets(x) ngets((x), 0)
^
/usr/src/stand/libsa/stand.h:270:13: note: previous declaration is here
extern void ngets(char *, int);
^
/usr/include/stdio.h:271:6: error: conflicting types for 'putchar'
int putchar(int);
^
/usr/src/stand/libsa/stand.h:382:14: note: previous declaration is here
extern void putchar(int);
^
/usr/include/stdio.h:286:6: error: conflicting types for 'vprintf'
int vprintf(const char * __restrict, __va_list);
^
/usr/src/stand/libsa/stand.h:262:13: note: previous declaration is here
extern void vprintf(const char *fmt, __va_list);
^
/usr/src/stand/libsa/stand.h:265:13: note: previous declaration is here
extern void vsprintf(char *buf, const char *cfmt, __va_list);
^
7 errors generated.
*** [g_eli_hmac.o] Error code 1
make[1]: stopped in /usr/src/stand/geli
1 error
make[1]: stopped in /usr/src/stand/geli
*** [all_subdir_geli] Error code 2
make: stopped in /usr/src/stand
1 error
make: stopped in /usr/src/stand
Tom
--
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Thomas Laus
2017-12-09 02:31:53 UTC
Permalink
Post by Warner Losh
Looks like -DEFI_ZFS_BOOT was dropped from boot1.c in r326589. I've fixed
it in r326714.
Warren:

I just completed a buildworld on r326720 and it failed to boot again
with a hex dump and BTX failure. I can take a photograph and post it
somewhere of the hex dump if it would provide additional information.
I can also send you an email with the photo attached.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Warner Losh
2017-12-09 03:06:11 UTC
Permalink
Post by Thomas Laus
Post by Warner Losh
Looks like -DEFI_ZFS_BOOT was dropped from boot1.c in r326589. I've
fixed
Post by Warner Losh
it in r326714.
I just completed a buildworld on r326720 and it failed to boot again
with a hex dump and BTX failure. I can take a photograph and post it
somewhere of the hex dump if it would provide additional information.
I can also send you an email with the photo attached.
Clean build?

Warner
Thomas Laus
2017-12-09 14:12:50 UTC
Permalink
Post by Warner Losh
Clean build?
It was a clean build. I performed a rm -rf /usr/obj/* before starting
the buildworld after cvs updating to r326720.

A couple of notes on my hardware:

I am not using UEFI and my CPU is an early Intel i5 Skylake that has the
Silicon Debug flag turned on as a default. Most operating systems
including FreeBSD turn off this flag during the boot process because of
security concerns. Since BTX is very early in the boot process, the
computer is powered up with SDBG turned on.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Thomas Laus
2017-12-09 15:24:49 UTC
Permalink
Post by Warner Losh
Clean build?
Here is the contents of my loader.conf:

loader_logo="beastie" #Desired logo:orbbw, orb, fbsdbw, beastiebw, beastie, none
aesni_load="YES"
geom_eli_load="YES"
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
zfs_load="YES"

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Warner Losh
2017-12-09 17:56:20 UTC
Permalink
Post by Thomas Laus
Post by Warner Losh
Clean build?
It was a clean build. I performed a rm -rf /usr/obj/* before starting
the buildworld after cvs updating to r326720.
I am not using UEFI and my CPU is an early Intel i5 Skylake that has the
Silicon Debug flag turned on as a default. Most operating systems
including FreeBSD turn off this flag during the boot process because of
security concerns. Since BTX is very early in the boot process, the
computer is powered up with SDBG turned on.
OK. I don't recall seeing a screen shot of the entire boot. Can you send
that too (privately if you like) so I know exactly what's failing? Is it
gptzfsboot loading /boot/loader? Is it early in /boot/laoder or ????

Warner
Toomas Soome
2017-12-09 18:16:57 UTC
Permalink
Post by Warner Losh
Post by Thomas Laus
Post by Warner Losh
Clean build?
It was a clean build. I performed a rm -rf /usr/obj/* before starting
the buildworld after cvs updating to r326720.
I am not using UEFI and my CPU is an early Intel i5 Skylake that has the
Silicon Debug flag turned on as a default. Most operating systems
including FreeBSD turn off this flag during the boot process because of
security concerns. Since BTX is very early in the boot process, the
computer is powered up with SDBG turned on.
OK. I don't recall seeing a screen shot of the entire boot. Can you send
that too (privately if you like) so I know exactly what's failing? Is it
gptzfsboot loading /boot/loader? Is it early in /boot/laoder or ????
With BIOS boot you can try to press key (space or anything else except enter) - if boot1 is good, you will get boot: prompt.

rgds,
toomas
Thomas Laus
2017-12-10 00:07:08 UTC
Permalink
Post by Toomas Soome
With BIOS boot you can try to press key (space or anything else except enter) - if boot1 is good, you will get boot: prompt.
BTX fails a long time before boot1 is read. I don't even get a prompt
for my Geli password.

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Shawn Webb
2017-12-12 20:38:59 UTC
Permalink
I updated my amd64 computer today to r326622 and copied the
/boot/gptzfsboot file to each of my ZFS hard drives p1 partition. The
BTX loader stopped and could not load. This rendered my system
'un-bootable'. I copied this file from an earlier live filesystem CD,
which restored my computer and enabled me to boot.
I also see the problem with the loader logs dumping core that others
have reported recently.
I'm seeing the same issue with recent HardenedBSD 12-CURRENT/amd64
memstick images. Booting in UEFI mode works, however.

Thanks,
--
Shawn Webb
Cofounder and Security Engineer
HardenedBSD

GPG Key ID: 0x6A84658F52456EEE
GPG Key Fingerprint: 2ABA B6BD EF6A F486 BE89 3D9E 6A84 658F 5245 6EEE
Warner Losh
2017-12-15 23:37:10 UTC
Permalink
Post by Warner Losh
I updated my amd64 computer today to r326622 and copied the
/boot/gptzfsboot file to each of my ZFS hard drives p1 partition. The
BTX loader stopped and could not load. This rendered my system
'un-bootable'. I copied this file from an earlier live filesystem CD,
which restored my computer and enabled me to boot.
Any chance you can bisect when this happened? I think I'll need more
details to see what happened. What was your old loader that world based on?
I believe that these issues have been corrected in r326888. My refactoring
to make it easier to bring in the lua boot loader in r326593 (after
breaking the build in r326584 accidentally) uncovered some latent subtle
ordering issues. This cause GELI-enabled (but not even using) ZFS boot
loaders to fail. This was related to an odd interaction between zfs and
geli implementation files in gptzfsboot (and zfsboot) which caused us to
have two different implementations of malloc, with all the fun you'd expect
when the second one got called.

If you have issues after r326888, please let me know.

Warner
Thomas Laus
2017-12-16 17:05:45 UTC
Permalink
Post by Warner Losh
I believe that these issues have been corrected in r326888. My
refactoring to make it easier to bring in the lua boot loader in r326593
(after breaking the build in r326584 accidentally) uncovered some latent
subtle ordering issues. This cause GELI-enabled (but not even using) ZFS
boot loaders to fail. This was related to an odd interaction between zfs
and geli implementation files in gptzfsboot (and zfsboot) which caused
us to have two different implementations of malloc, with all the fun
you'd expect when the second one got called.
If you have issues after r326888, please let me know.
Warner & Group

I updated my system this morning to r326897 and can confirm that this
problem has been solved.

Good work Warner!

Tom
--
Public Keys:
PGP KeyID = 0x5F22FDC1
GnuPG KeyID = 0x620836CF
Loading...