From 18e2b1fcd0ca11f14e352f364a874e849910f4e4 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 19:58:58 +0000 Subject: [PATCH 01/12] typo/grammar --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 9c62679e..782a88ad 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -9,7 +9,7 @@ =abstract start -At one of my client we had a Bash script that grepped a huge log file 20 times in order to generate a report. +One of my clients had a Bash script that grepped a huge log file 20 times in order to generate a report. It created a lot of load on the server as grep was reading the entire file 20 times. As we were converting our Shell scripts to Python anyway I thought I could rewrite it in Python and go over the file From 5b3a0bb165fc2a37c15f5e1b3ea3083f9ebb8761 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 20:04:40 +0000 Subject: [PATCH 02/12] Move the command lines that create and verify the test input While reading the article, this is how I wish it had been presented. --- ...re-the-speed-of-grep-with-python-regex.txt | 31 +++++++++---------- 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 782a88ad..c2ca257f 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -31,11 +31,24 @@ We can run it like this, indicating the name of the file we would like to create the number of rows and the length of rows. -python create-big-file.py FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS +$ python create-big-file.py FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS + + +For example: + + +$ python create-big-file.py a.txt 100000 50 It will create a file full of the character "x", with a single "y" somewhere. + +$ wc a.txt + 1000000 1000000 51000000 a.txt +$ grep y a.txt +xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx + + I think this is going to be good enough for our simple example.

Using grep

@@ -61,21 +74,7 @@ and thous would be probably faster, but in our cases we really had more complex

Comparing the speed

- -python create-big-file.py a.txt 100000 50 - - -Verify the file: - - -$ wc a.txt - 1000000 1000000 51000000 a.txt - - - -# grep y a.txt -xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx - +Here are the results of running the grep test: $ time bash examples/grep_speed.sh a.txt 20 From 28ddbcf962a0f1aab37750cb48665dd617bffcfa Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 20:07:01 +0000 Subject: [PATCH 03/12] typo/grammar --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index c2ca257f..9ed972f0 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -54,7 +54,7 @@ I think this is going to be good enough for our simple example.

Using grep

In the original shell script we had some 20 different calls to grep, -but to make it simpler I made this shell script with that runs the same regex multiple times. +but to make it simpler I made this shell script, which runs the same regex multiple times. From 3ae3b21eea5f1d6a6683bcd751c4f56c113dc615 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 20:08:11 +0000 Subject: [PATCH 04/12] typo/grammar --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 9ed972f0..1e4a9aa2 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -58,7 +58,7 @@ but to make it simpler I made this shell script, which runs the same regex multi -You can pass the name of the data file and the number of time you'd like to run grep. +You can pass the name of the data file and the number of times you'd like to run grep.

Grep with Python regexes

From ebdfe43b36bbb35b639a785e5d3001fdce9d81d2 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 20:09:15 +0000 Subject: [PATCH 05/12] typo/grammar --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 1e4a9aa2..787ac5f7 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -94,7 +94,7 @@ sys 0m0.032s
-grep is about 50 times faster than Python even though grep had to read the file 20 time while Python only read it once. +grep is about 50 times faster than Python even though grep had to read the file 20 times while Python only read it once.

More complex grep

From 514dcc77ffa4cf50477b25722de27670cbee3f7a Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 20:09:48 +0000 Subject: [PATCH 06/12] better punctuation: s/,/;/ --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 787ac5f7..be801749 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -99,7 +99,7 @@ sys 0m0.032s

More complex grep

-In the previous case we used a very simple regex, now let's change it to use a slightly more complex expression +In the previous case we used a very simple regex; now, let's change it to use a slightly more complex expression in which we are not only looking for a single character, but we also want to make sure it is between two identical characters. From 6400c7096ebdf0bbe77f028c32d174327c6585b9 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 20:12:06 +0000 Subject: [PATCH 07/12] grammar --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index be801749..7899f58e 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -152,7 +152,7 @@ grep (GNU grep) 3.4

Other cases

-The results are consistent with what I saw during my work, but I wonder what would be the results if the file was larger than the available memory in my computer. +The results are consistent with what I saw during my work, but I wonder what the results would be if the file were larger than the available memory in my computer.

Conclusion

From 69e72c73805232acbb0f8c7436c48965c612387a Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 20:12:46 +0000 Subject: [PATCH 08/12] typo/grammar --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 7899f58e..cef20b41 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -160,7 +160,7 @@ The results are consistent with what I saw during my work, but I wonder what the Or I made a mistake somewhere that impacts the results. -Oh and one more thing, I also create a Perl version of the code and +Oh and one more thing, I also created a Perl version of the code and Perl is much faster than Python even though it is also slower than the grep code. From 7fdc142b4345d9c10649f34d44a93af641b6a950 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 20:39:24 +0000 Subject: [PATCH 09/12] grep_speed*.py: Match the bash code better at first * For a better initial comparison, the pyhon code has been altered to match the bash code as closely as possible; a function named 'grep' is used to simulate a call to the program 'grep'. * Next, a couple of a new Python code files have been introduced: examples/grep_speed-open-once.py examples/grep_speed-optimized.py In particular: * grep_speed-open-once.py: This simply alters the Python code to ensure that the file is opened only once, using 'seek' to return to the beginning of the file on each iteration. * grep_speed-optimized.py: In addition, this one compiles the regular expression only once, which significantly improves the performance, but the results are still very slow. * The article text has been updated to reflect these changes; I also used my timing results to make sure that things are consistent. --- examples/grep_speed.py | 15 ++++--- examples/grep_speed.sh | 12 +++-- examples/grep_speed_open_once.py | 20 +++++++++ examples/grep_speed_optimized.py | 21 +++++++++ ...re-the-speed-of-grep-with-python-regex.txt | 45 +++++++++++++++---- 5 files changed, 96 insertions(+), 17 deletions(-) create mode 100644 examples/grep_speed_open_once.py create mode 100644 examples/grep_speed_optimized.py diff --git a/examples/grep_speed.py b/examples/grep_speed.py index 16b4d7bc..8325d2bc 100644 --- a/examples/grep_speed.py +++ b/examples/grep_speed.py @@ -6,9 +6,14 @@ _, filename, limit = sys.argv -with open(filename) as fh: - for line in fh: - for _ in range(int(limit)): - if re.search(r'y', line): - print(line) +def grep(regex, filename): + with open(filename) as fh: + for line in fh: + if re.search(regex, line): + print(line, end='') +i = int(limit) + +while i: + i-=1 + grep('y', filename) diff --git a/examples/grep_speed.sh b/examples/grep_speed.sh index 9d946dde..2ced1a2a 100644 --- a/examples/grep_speed.sh +++ b/examples/grep_speed.sh @@ -1,7 +1,13 @@ +#!/bin/bash + +if (($# != 2)); then + echo "$0" 'FILENAME LIMIT' >&2 + exit 1 +fi + filename=$1 limit=$2 -for ((i=1;i<=$limit;i++)); -do - grep y $filename +for ((i=limit; i; --i)); do + grep y "$filename" done diff --git a/examples/grep_speed_open_once.py b/examples/grep_speed_open_once.py new file mode 100644 index 00000000..1c089f3d --- /dev/null +++ b/examples/grep_speed_open_once.py @@ -0,0 +1,20 @@ +import sys +import re + +if len(sys.argv) != 3: + exit(f"{sys.argv[0]} FILENAME LIMIT") + +_, filename, limit = sys.argv + +def grep(regex, fh): + for line in fh: + if re.search(regex, line): + print(line, end='') + +i = int(limit) + +with open(filename) as fh: + while i: + i-=1 + grep('y', fh) + fh.seek(0) diff --git a/examples/grep_speed_optimized.py b/examples/grep_speed_optimized.py new file mode 100644 index 00000000..041b7924 --- /dev/null +++ b/examples/grep_speed_optimized.py @@ -0,0 +1,21 @@ +import sys +import re + +if len(sys.argv) != 3: + exit(f"{sys.argv[0]} FILENAME LIMIT") + +_, filename, limit = sys.argv + +def grep(regex, fh): + for line in fh: + if regex.search(line): + print(line, end='') + +i = int(limit) + +y = re.compile('y') +with open(filename) as fh: + while i: + i-=1 + grep(y, fh) + fh.seek(0) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index cef20b41..2d1502b4 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -77,24 +77,51 @@ and thous would be probably faster, but in our cases we really had more complex Here are the results of running the grep test: -$ time bash examples/grep_speed.sh a.txt 20 +$ time bash examples/grep_speed.sh a.txt 20 >/dev/null -real 0m0.227s -user 0m0.055s -sys 0m0.172s +real 0m0.355s +user 0m0.238s +sys 0m0.097s -$ time python examples/grep_speed.py a.txt 20 +$ time python examples/grep_speed.py a.txt 20 >/dev/null -real 0m9.509s -user 0m9.477s -sys 0m0.032s +real 0m9.897s +user 0m9.772s +sys 0m0.120s +So, grep is upwards of 30 times faster than Python; what if we optimize the Python code by only opening the file once? -grep is about 50 times faster than Python even though grep had to read the file 20 times while Python only read it once. + + +Making that change did almost nothing to improve the speed. + + +$ time python examples/grep_speed_open_once.py a.txt 20 >/dev/null + +real 0m9.712s +user 0m9.625s +sys 0m0.082s + + +What if we optimize the regular expression by compiling it only once? + + + +That makes a signicant improvement! + + +$ time python examples/grep_speed_optimized.py a.txt 20 >/dev/null + +real 0m2.198s +user 0m2.121s +sys 0m0.075s + + +By pre-compiling the regular expression, the Python code is now about 4.5x faster than the unoptimized Python code; however, grep is still about 6 times faster than Python, even though grep must start from afresh on each iteration.

More complex grep

From 9020ef8203616a9fa302e2dbfa5812212bba5d74 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 21:38:03 +0000 Subject: [PATCH 10/12] Add output from test command --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 2d1502b4..91f05822 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -139,7 +139,8 @@ identical characters. You can try it yourself: -grep '\(.\)y\1' a.txt +$ grep '\(.\)y\1' a.txt +xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx From 89b487b5421e29f8946f934f4c28860b5add2cb1 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 21:50:33 +0000 Subject: [PATCH 11/12] Update the examples of a regular expression that is more complex * The optimized Python code is discussed initially. * The unoptimized Python code is discussed next. * The timing has been updated to reflect what I'm seeing. --- examples/grep_speed_oxo.py | 15 +++++++--- examples/grep_speed_oxo.sh | 12 ++++++-- examples/grep_speed_oxo_unoptimized.py | 19 ++++++++++++ ...re-the-speed-of-grep-with-python-regex.txt | 30 +++++++++++++------ 4 files changed, 60 insertions(+), 16 deletions(-) create mode 100644 examples/grep_speed_oxo_unoptimized.py diff --git a/examples/grep_speed_oxo.py b/examples/grep_speed_oxo.py index 5c3644cc..9b4d3d4a 100644 --- a/examples/grep_speed_oxo.py +++ b/examples/grep_speed_oxo.py @@ -6,9 +6,16 @@ _, filename, limit = sys.argv -with open(filename) as fh: +def grep(regex, fh): for line in fh: - for _ in range(int(limit)): - if re.search(r'(.)y\1', line): - print(line) + if regex.search(line): + print(line, end='') + +i = int(limit) +y = re.compile(r'(.)y\1') +with open(filename) as fh: + while i: + i-=1 + grep(y, fh) + fh.seek(0) diff --git a/examples/grep_speed_oxo.sh b/examples/grep_speed_oxo.sh index e92fc2d4..95b0207a 100644 --- a/examples/grep_speed_oxo.sh +++ b/examples/grep_speed_oxo.sh @@ -1,7 +1,13 @@ +#!/bin/bash + +if (($# != 2)); then + echo "$0" 'FILENAME LIMIT' >&2 + exit 1 +fi + filename=$1 limit=$2 -for ((i=1;i<=$limit;i++)); -do - grep '\(.\)y\1' $filename +for ((i=limit; i; --i)); do + grep '\(.\)y\1' "$filename" done diff --git a/examples/grep_speed_oxo_unoptimized.py b/examples/grep_speed_oxo_unoptimized.py new file mode 100644 index 00000000..1d75fe7c --- /dev/null +++ b/examples/grep_speed_oxo_unoptimized.py @@ -0,0 +1,19 @@ +import sys +import re + +if len(sys.argv) != 3: + exit(f"{sys.argv[0]} FILENAME LIMIT") + +_, filename, limit = sys.argv + +def grep(regex, filename): + with open(filename) as fh: + for line in fh: + if re.search(regex, line): + print(line, end='') + +i = int(limit) + +while i: + i-=1 + grep(r'(.)y\1', filename) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 91f05822..60be1733 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -147,23 +147,35 @@ xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx

Comparing the speed of the more complex examples

-$ time bash examples/grep_speed_oxo.sh a.txt 20 +$ time bash examples/grep_speed_oxo.sh a.txt 20 >/dev/null -real 0m0.196s -user 0m0.035s -sys 0m0.161s +real 0m0.413s +user 0m0.297s +sys 0m0.097s -$ time python examples/grep_speed_oxo.py a.txt 20 +$ time python examples/grep_speed_oxo.py a.txt 20 >/dev/null -real 0m25.067s -user 0m24.972s -sys 0m0.016s +real 0m12.724s +user 0m12.589s +sys 0m0.128s -The speed of grep did not change, but Python became even slower. This time grep is more than a 100 times faster than Python. +The speed of grep did not change appreciably, but the Python code became much slower; this time, grep is more than a 30 times faster than Python, despite using some explicit optimizations in the Python code. How does the unoptimized code fair? + + + +Using grep is about 57 times faster than using the unoptimized Python code. + + +$ time python examples/grep_speed_oxo_unoptimized.py a.txt 20 >/dev/null + +real 0m23.448s +user 0m23.319s +sys 0m0.114s +

Version information

From 6128a67078de5038cd8e99960688f5f443f7c772 Mon Sep 17 00:00:00 2001 From: Michael Witten Date: Sun, 19 Sep 2021 21:52:40 +0000 Subject: [PATCH 12/12] Update the grep version to what I'm using --- sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt index 60be1733..fe210fea 100644 --- a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt +++ b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt @@ -186,7 +186,7 @@ Python 3.8.2 $ grep -V -grep (GNU grep) 3.4 +grep (GNU grep) 3.3