grep extremely slow for LC

Discussion:

grep extremely slow for LC_CTYPE=C?

(too old to reply)

Stefan Esser

2018-05-03 14:08:24 UTC

Hi all,

while working on a new portmaster version, I found that bsdgrep is much
faster in an UTF-8 locale than in the C locale, much to my surprise.

I have uploaded a small shell-script with test data that can be fetched
from:

https://people.freebsd.org/~se/grep-test.txz

The script uses "grep -v -f patternfile datafile" to select from datafiles
the lines that are not matched by the contents of patternfile:

#-------------------------------------------------------------------
#!/bin/sh

LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8

export LANG LC_CTYPE

time grep -v -f grep-test-pattern grep-test-data

LANG=C
LC_CTYPE=C
#unset LANG LC_CTYPE # is an alternative leading to the same result ...

time grep -v -f grep-test-pattern grep-test-data
#-------------------------------------------------------------------

The first "grep" needs 3.5 seconds to finish on my system, but the second
one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not
bother to check whether it finishes at all).

Is this a bug in grep?

Maybe there is something odd in the data file (loading the pattern is not
slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a
problem that was observed with "real" data, not a specifically constructed
worst case.

Any ideas what's causing this behavior?

I'm currently setting the UTF-8 locale as in the first invocation above
to make grep run in reasonable time, but I'd expect it to be faster in
the C locale ...

Regards, STefan

Kyle Evans

2018-05-03 14:41:25 UTC

Permalink

Post by Stefan Esser
Hi all,
while working on a new portmaster version, I found that bsdgrep is much
faster in an UTF-8 locale than in the C locale, much to my surprise.
I have uploaded a small shell-script with test data that can be fetched
https://people.freebsd.org/~se/grep-test.txz
The script uses "grep -v -f patternfile datafile" to select from datafiles
#-------------------------------------------------------------------
#!/bin/sh
LANG=en_US.UTF-8
LC_CTYPE=en_US.UTF-8
export LANG LC_CTYPE
time grep -v -f grep-test-pattern grep-test-data
LANG=C
LC_CTYPE=C
#unset LANG LC_CTYPE # is an alternative leading to the same result ...
time grep -v -f grep-test-pattern grep-test-data
#-------------------------------------------------------------------
The first "grep" needs 3.5 seconds to finish on my system, but the second
one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not
bother to check whether it finishes at all).
Is this a bug in grep?
Maybe there is something odd in the data file (loading the pattern is not
slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a
problem that was observed with "real" data, not a specifically constructed
worst case.
Any ideas what's causing this behavior?
I'm currently setting the UTF-8 locale as in the first invocation above
to make grep run in reasonable time, but I'd expect it to be faster in
the C locale ...
Regards, STefan

Hmm... what does `grep -V` look like, just to confirm?

These are the results on my local system:

***@viper:/tmp/grep# ./grep-test.sh
All/mpfr-3.1.7.tgz
0.10 real 0.10 user 0.00 sys
All/mpfr-3.1.7.tgz
0.09 real 0.08 user 0.00 sys

But I don't immediately recall if I have local modifications in
regex(3)/bsdgrep that might have affected this. =(

Thanks,

Kyle Evans

Stefan Esser

2018-05-03 15:19:34 UTC

Permalink

Am 03.05.18 um 16:41 schrieb Kyle Evans:

Hi Kyle,

thank you for the fast reply. You were right to request grep -V output,
but see below ... ;-)

Post by Kyle Evans

Post by Stefan Esser
The first "grep" needs 3.5 seconds to finish on my system, but the second
one (with LC_CTYPE=C or no locale set at all) runs for minutes (I did not
bother to check whether it finishes at all).
Is this a bug in grep?
Maybe there is something odd in the data file (loading the pattern is not
slower with LC_CTYPE=C, it takes 0.8 seconds on my system), but this is a
problem that was observed with "real" data, not a specifically constructed
worst case.
Any ideas what's causing this behavior?
I'm currently setting the UTF-8 locale as in the first invocation above
to make grep run in reasonable time, but I'd expect it to be faster in
the C locale ...
Regards, STefan

Hmm... what does `grep -V` look like, just to confirm?

Ah, yes, good point ...

$ which grep
/usr/bin/grep

$ grep -V
grep (GNU grep) 2.5.1-FreeBSD

Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

So, it seems I have to complain somewhere else about this behavior ...

But I have (for a long time) in my /etc/src.conf:

WITH_BSDGREP= yes
WITH_BSD_GREP_FASTMATCH= yes
WITHOUT_GNU_GREP_COMPAT= yes

And before seeing the grep -V output, I was convinced that I had been using
BSD grep (i.e. that it replaced GNU grep with above options) by default ...

But now I see that I need to invoke bsdgrep under that name. It is very fast,
but does not give the expected (correct?) result, which is the single line
that is not suppressed by the pattern match ...

Post by Kyle Evans
All/mpfr-3.1.7.tgz
0.10 real 0.10 user 0.00 sys
All/mpfr-3.1.7.tgz
0.09 real 0.08 user 0.00 sys
But I don't immediately recall if I have local modifications in
regex(3)/bsdgrep that might have affected this. =(

Yes, that's the correct result and extremely fast!

But on my system (with only "bsdgrep" substituted for "grep") I get

$ sh bsdgrep-test.sh | wc
0.15 real 0.14 user 0.00 sys
0.15 real 0.15 user 0.00 sys
3362 3362 94700

I.e. only about 1/3 of the lines are suppressed by the pattern, while all
but 1 line should be ...

Or is one of the build options that I used unsafe?

Best regards, STefan

Stefan Esser

2018-05-03 17:54:56 UTC

Permalink

Post by Stefan Esser

Post by Kyle Evans
Hmm... what does `grep -V` look like, just to confirm?

Eh, no worries there. Newer GNU grep sucks less, and we're going to
replace it Real Soon Now (TM).

Thank you very much - your reply was really helpful!

I just tested with GNU grep 2.27 (the current port version) and it does not
show the extreme slowness of the old version in FreeBSD, but is still more
than 10 times slower than BSD grep on my test data.

Post by Stefan Esser
WITH_BSDGREP= yes
WITH_BSD_GREP_FASTMATCH= yes
WITHOUT_GNU_GREP_COMPAT= yes
And before seeing the grep -V output, I was convinced that I had been using
BSD grep (i.e. that it replaced GNU grep with above options) by default ...
But now I see that I need to invoke bsdgrep under that name. It is very fast,
but does not give the expected (correct?) result, which is the single line
that is not suppressed by the pattern match ...

This is actually because you've typo'd WITH_BSD_GREP. =) WITH_BSD_GREP
will replace /usr/bin/grep with bsdgrep and put GNU grep at
/usr/bin/gnugrep.

Yes, that was what I had expected, and I had correctly spelled WITH_BSD_PATCH,
but never bother to check that I got the "grep" I wanted ...

I also recommend using WITHOUT_BSD_GREP_FASTMATCH / not using
WITH_BSD_GREP_FASTMATCH. See below response.

It is so much faster than GNU grep on this use-case anyway ;-)

$ sh grep-test.sh
All/mpfr-3.1.7.tgz
0.14 real 0.13 user 0.00 sys
All/mpfr-3.1.7.tgz
0.13 real 0.13 user 0.00 sys

This is a factor 30 to 40 better than with our GNU grep (for the UTF-8 case,
where it finishes in finite time, orders of magnitude faster for LANG=C ;-) ).

And yes, FASTMATCH was responsible for the erroneous result in my previous
tests with BSD grep. Now that I have rebuild it without that option, it works
perfectly for me :)

BSD_GREP_FASTMATCH is best left off (default on HEAD)- it was disabled
because the version of tre ("fastmatch") that bsdgrep uses is buggy
and I don't want to invest the time to fix it. The performance of the
version we use isn't any better than our libc regex(3), so I made the
decision to switch it to that and focus efforts on optimizing our
general regex implementation instead.

A decision I can well understand and sympathize with.

How about removing the BSD_GREP_FASTMATCH option, then?

I have plans to replace our libc regex(3) with Onigmo [1], which is at
least twice as fast as what we have and comes with all kinds of other
extensions- GNU extensions will be exposed via libregex, and I also
plan to install Onigmo on its own so that others can use that with its
own interface. The difference between it and libregex will be that
libregex exposes a regex(3) interface for using extensions with an
option to go REG_POSIX.
[1] https://github.com/k-takata/Onigmo

Great plan! But for now BSD grep seems well up to the task and my only
problem is now, that I need to support stable releases that use (and will
stay with) the old GNU grep, so I'll need to keep the work-around (or
perhaps depend on the port version?).

Thanks again!

Best regards, STefan

Kyle Evans

2018-05-03 18:11:05 UTC

Permalink

Post by Stefan Esser

Post by Kyle Evans
Hmm... what does `grep -V` look like, just to confirm?

Eh, no worries there. Newer GNU grep sucks less, and we're going to
replace it Real Soon Now (TM).

Thank you very much - your reply was really helpful!
I just tested with GNU grep 2.27 (the current port version) and it does not
show the extreme slowness of the old version in FreeBSD, but is still more
than 10 times slower than BSD grep on my test data.

This is good. =) We tend to be slower in most areas, so any win is a good one.

Post by Stefan Esser

This is actually because you've typo'd WITH_BSD_GREP. =) WITH_BSD_GREP
will replace /usr/bin/grep with bsdgrep and put GNU grep at
/usr/bin/gnugrep.

Yes, that was what I had expected, and I had correctly spelled WITH_BSD_PATCH,
but never bother to check that I got the "grep" I wanted ...

I also recommend using WITHOUT_BSD_GREP_FASTMATCH / not using
WITH_BSD_GREP_FASTMATCH. See below response.

It is so much faster than GNU grep on this use-case anyway ;-)
$ sh grep-test.sh
All/mpfr-3.1.7.tgz
0.14 real 0.13 user 0.00 sys
All/mpfr-3.1.7.tgz
0.13 real 0.13 user 0.00 sys
This is a factor 30 to 40 better than with our GNU grep (for the UTF-8 case,
where it finishes in finite time, orders of magnitude faster for LANG=C ;-) ).
And yes, FASTMATCH was responsible for the erroneous result in my previous
tests with BSD grep. Now that I have rebuild it without that option, it works
perfectly for me :)

Also good to hear!

Post by Stefan Esser

A decision I can well understand and sympathize with.
How about removing the BSD_GREP_FASTMATCH option, then?

Right- I've been meaning to find time to rip it all out. I'll see if I
can harvest some spare time from the weekend to make it happen.

Post by Stefan Esser

I do recommend pulling in textproc/gnugrep if you can. GNU grep in
base has bugs that are likely going to stay unless someone (that isn't
me =)) wants to take up the task of maintaining an older version of
GNU Grep that's going to be disappearing from head. Newer versions
have a lot more sensible behavior than what we have in base.

Post by Stefan Esser
Thanks again!
Best regards, STefan