| View previous topic :: View next topic |
| Author |
Message |
John R. Graham Administrator


Joined: 08 Mar 2005 Posts: 4849 Location: Somewhere over Atlanta, Georgia
|
Posted: Mon Apr 14, 2008 3:31 pm Post subject: |
|
|
Have you heard this old joke? A man goes to see his doctor, and says, "Doc, it hurts when I move my arm like this." The doctor replies, "Well, don't do that!" Which brings me to a question. Were you really trying to search for the following?- Literal string "10", followed by
- Any (possibly multibyte) character, followed by
- Literal string "0" followed by
- Any (possibly multibyte) character, followed by
- Literal string "0" followed by
- Any (possibly multibyte) character, followed by
- Literal string "1"
I didn't think you were. marduk didn't think you were. He suggested that you use "grep -F" or "fgrep". I suggested that you escape the metacharacters that you didn't really want to be metacharacters. Now, in light of all of this, I think your most recent question devolves into | Quote: | | How to I Band-Aid my system so that I can continue to use this tool badly and not suffer the performance penalty? | My advice to you is, don't do that! It'll eventually bite you in an unexpected way. In fact, it just did.
Now, if a properly formed search pattern is shown to have low performance with grep or a literal search has low performance in fgrep, then I'm still very interested in helping you figure out why.
- John _________________ Yoda: "Intentionally left blank, this space is." |
|
| Back to top |
|
 |
kidoln n00b

Joined: 03 Apr 2008 Posts: 15
|
Posted: Tue Apr 15, 2008 12:55 am Post subject: |
|
|
| john_r_graham wrote: | Have you heard this old joke? A man goes to see his doctor, and says, "Doc, it hurts when I move my arm like this." The doctor replies, "Well, don't do that!" Which brings me to a question. Were you really trying to search for the following?- Literal string "10", followed by
- Any (possibly multibyte) character, followed by
- Literal string "0" followed by
- Any (possibly multibyte) character, followed by
- Literal string "0" followed by
- Any (possibly multibyte) character, followed by
- Literal string "1"
I didn't think you were. marduk didn't think you were. He suggested that you use "grep -F" or "fgrep". I suggested that you escape the metacharacters that you didn't really want to be metacharacters. Now, in light of all of this, I think your most recent question devolves into | Quote: | | How to I Band-Aid my system so that I can continue to use this tool badly and not suffer the performance penalty? | My advice to you is, don't do that! It'll eventually bite you in an unexpected way. In fact, it just did.
Now, if a properly formed search pattern is shown to have low performance with grep or a literal search has low performance in fgrep, then I'm still very interested in helping you figure out why.
- John |
I am so agree with you. But see the output:
| Code: | time /bin/grep '10\.0\.0\.1 ' log > tmp
real 3m42.600s
user 3m42.381s
sys 0m0.108s
|
There is no big difference, 3min vs 5min. Apparently, there are still some problems in my the current grep. Consider the better performance of grep from ubuntu without LC_ALL=C.
Even I use the bad way to filter the string, by adding LC_ALL=C, grep spends only 1 second to finish the job.
The reason that I keep using "grep -w 10.0.0.1 log" is because I want to show the performance improvement for different solutions. In my real work, the script has changed to your suggestion. Thank you. See your suggestion with LC_ALL=C
| Code: | time grep '10\.0\.0\.1 ' log > tmp
real 0m0.252s
user 0m0.188s
sys 0m0.064s
|
|
|
| Back to top |
|
 |
Akkara Administrator


Joined: 28 Mar 2006 Posts: 3715 Location: &akkara
|
Posted: Tue Apr 15, 2008 10:08 am Post subject: |
|
|
| Quote: | | Consider the better performance of grep from ubuntu without LC_ALL=C |
Are all the locale settings identical in gentoo and ubuntu? If not that could explain the speed difference. If they are identical, it might point to a problem with how gentoo processes your locale. |
|
| Back to top |
|
 |
kidoln n00b

Joined: 03 Apr 2008 Posts: 15
|
Posted: Tue Apr 15, 2008 7:03 pm Post subject: |
|
|
| Akkara wrote: | | Quote: | | Consider the better performance of grep from ubuntu without LC_ALL=C |
Are all the locale settings identical in gentoo and ubuntu? If not that could explain the speed difference. If they are identical, it might point to a problem with how gentoo processes your locale. |
I believe that they are same. |
|
| Back to top |
|
 |
Zucca Apprentice


Joined: 14 Jun 2007 Posts: 201 Location: Helsinki, Finland
|
Posted: Thu May 15, 2008 1:36 pm Post subject: Thanks |
|
|
I have same problems too. And over remote connection (ssh) the results are faster.
Problem was that I use UTF and ISO locales in different cituations.
I would never have believed that locale could cause this slow grep processing.
Now instead of 2 minutes my log stat script runs trough all the processes in 15 seconds. :)
Thanks again! _________________ Threading support for your bash scripts. |
|
| Back to top |
|
 |
colo Apprentice


Joined: 21 Mar 2004 Posts: 160 Location: Austria
|
Posted: Fri Aug 29, 2008 7:51 pm Post subject: |
|
|
I've been hit by this once again today, and the thing that REALLY startles me is that my Ubuntu 8.04 machine does not suffer from the speed decrease when using a multibyte locale (en_US.utf8). My gentoo version of `grep` does, no matter if version 2.5.1 or 2.5.3, or if compiled with PCRE support or not...
A quick survey on IRC suggested the same for other fellow Gentoo users. Anyone more clever than I here who can explain that to me? _________________ Free Software. Free Sociecty. Better Lives. |
|
| Back to top |
|
 |
colo Apprentice


Joined: 21 Mar 2004 Posts: 160 Location: Austria
|
Posted: Sat Aug 30, 2008 7:52 am Post subject: |
|
|
By the way, if I advise grep to interpret my regex (just a string literal in my test, actually) to interpret it as a PCRE (using libpcre for matching in turn, I guess) with -P, my locale does not have this abhorrent impact on performance. Is there some defect in glibc's regex(3) functions that I'm not aware of? _________________ Free Software. Free Sociecty. Better Lives. |
|
| Back to top |
|
 |
Akkara Administrator


Joined: 28 Mar 2006 Posts: 3715 Location: &akkara
|
Posted: Mon Sep 28, 2009 5:51 am Post subject: Why is egrep so much slower than sed? [Solved] |
|
|
I have a ~11MB text file, regular ascii 7-bit. I want to extract all lines that begin with "have" or "want".
Using egrep is really slow: | Code: | $ time egrep '^have|^want' file.txt >/dev/null
real 1m5.023s
user 1m4.886s
sys 0m0.093s
$ time egrep '^(have|want)' file.txt >/dev/null
real 1m5.469s
user 1m5.346s
sys 0m0.070s |
Using sed is fast: | Code: | $ time sed -n -e '/^have/p' -e '/^want/p' file.txt >/dev/null
real 0m0.264s
user 0m0.263s
sys 0m0.000s |
Why is there such a large disparity?
Last edited by Akkara on Mon Sep 28, 2009 11:40 am; edited 1 time in total |
|
| Back to top |
|
 |
truc Advocate


Joined: 25 Jul 2005 Posts: 3078
|
Posted: Mon Sep 28, 2009 8:21 am Post subject: |
|
|
I don't know if that's really the reason, since I can't test on such a big file, but you're using extended regular expressions with egrep (grep -E) and BASIC regular expressions with sed. (and you're not even using the same regexp in both cases)
Could you actually time the following and report back:
| Code: | | sed -nr '/^have|^want/p' big_file |
PS: | Quote: | | In addition, two variant programs egrep and fgrep are available. egrep is the same as grep -E. fgrep is the same as grep -F. Direct invocation as either egrep or fgrep is deprecated, but is provided to allow historical applications that rely on them to run unmodified. |
 _________________ The End of the Internet! |
|
| Back to top |
|
 |
Akkara Administrator


Joined: 28 Mar 2006 Posts: 3715 Location: &akkara
|
Posted: Mon Sep 28, 2009 9:46 am Post subject: |
|
|
I don't have the original file (it was debugging output that's changing with every build).
So I re-ran all of the benchmarks on the current version of the file, which is now 22.5 MB. Of 1050374 lines in that file, 262144 match. I had also piped them through 'md5sum' to make sure all tests produce the same output. They do.
| Code: | $ time egrep '^have|^want' file.txt >/dev/null
real 2m10.062s
user 2m9.678s
sys 0m0.160s
$ time sed -n -e '/^have/p' -e '/^want/p' file.txt >/dev/null
real 0m0.479s
user 0m0.460s
sys 0m0.013s
$ time sed -nr '/^have|^want/p' file.txt >/dev/null
real 0m0.384s
user 0m0.363s
sys 0m0.010s
$ time grep '^[hw]a[vn][et]' file.txt >/dev/null
real 2m9.849s
user 2m9.575s
sys 0m0.163s |
Your suggestion is even faster than my sed equivalent.
I also tried regular grep with a modified expression in case it was the extended expressions that are causing problems. This isn't equivalent to the others, although in this file the same lines were matched. It was just as slow as the other greps.
I'm starting to wonder whether by grep is broken. Isn't there some kind of regular-expressions library that these sorts of apps all use? Going to try to re-emerge it and see what happens.
Edit: re-emerge of sys-apps/grep-2.5.4-r1 complete. I'm still getting similar slow times. |
|
| Back to top |
|
 |
Bill Cosby Guru


Joined: 22 Jan 2005 Posts: 430 Location: Aachen, Germany
|
Posted: Mon Sep 28, 2009 10:15 am Post subject: |
|
|
Hm, how long does this script take to execute for you:
| Code: | #!/usr/bin/perl
$op=shift;
$file=shift;
open(FILE, "<$file");
while(<FILE>) {
chomp;
print "$_\n" if (eval $op);
} |
Start it like
| Code: | | scriptname 'regex' file |
_________________ The Creature from Jekyll Island. |
|
| Back to top |
|
 |
Genone Retired Dev


Joined: 14 Mar 2003 Posts: 8690 Location: beyond the rim
|
Posted: Mon Sep 28, 2009 11:01 am Post subject: |
|
|
| grep is apparently heavily affected by locale settings (e.g. bug 283149), so try running it with LC_ALL=C to see if that changes anything. |
|
| Back to top |
|
 |
Akkara Administrator


Joined: 28 Mar 2006 Posts: 3715 Location: &akkara
|
Posted: Mon Sep 28, 2009 11:26 am Post subject: |
|
|
| Genone wrote: | | try running it with LC_ALL=C to see if that changes anything. |
That's the problem! | Code: | $ time LC_ALL=C egrep '^have|^want' file.txt >/dev/null
real 0m0.143s
user 0m0.143s
sys 0m0.000s |
Chalk up another one to locale silliness.
Any idea what a good interim solution is?
This particular use is pure ascii so setting LC_ALL works. What's the recommended way of doing this in a makefile?
But at the same time, I like 8-bit stuff parsed as utf8, since I'm tired of things like song name tags getting munged if I'm not super-careful if I happen to start a music app while in my coding environment, and then go edit a tag.
But I don't want the other silliness that comes with locales. In fact, I use LC_COLLATE=C globally. I hate-hate-hate the language-specific sortings, especially the fact that it seems to ignore space and punctuation and puts things like 'ab.x' and 'ab x' ahead of 'abc' - what kind of twisted thinking was going on to propose that, and how did it ever get through committee?
The ideal for me would be some kind of LC setting that effectively says, "parse the raw bytes as utf8 into their entity-integers, then sort/match/etc. against those integers in regular integer order, and display the results (converted back to utf8). Is there such a thing? |
|
| Back to top |
|
 |
Mike Hunt Watchman


Joined: 19 Jul 2009 Posts: 5287
|
Posted: Mon Sep 28, 2009 11:35 am Post subject: |
|
|
you could alias egrep and temporarily disable it when you need to | Code: | | alias egrep='LC_ALL=C egrep' |
and when needed run egrep unaliased: |
|
| Back to top |
|
 |
desultory Administrator

Joined: 04 Nov 2005 Posts: 7061
|
Posted: Tue Oct 20, 2009 10:48 am Post subject: |
|
|
| Mike Hunt wrote: | you could alias egrep and temporarily disable it when you need to | Code: | | alias egrep='LC_ALL=C egrep' |
and when needed run egrep unaliased: | Another approach would be create a more permanent alias, perhaps c_egrep and use egrep normally. Though the underlying problem still remains.
Merged the preceding seven posts. |
|
| Back to top |
|
 |
|