szabgab · mfwitten · Sep 19, 2021 · Sep 19, 2021 · Sep 19, 2021 · Sep 19, 2021
diff --git a/examples/grep_speed.py b/examples/grep_speed.py
@@ -6,9 +6,14 @@
 
 _, filename, limit = sys.argv
 
-with open(filename) as fh:
-    for line in fh:
-        for _ in range(int(limit)):
-            if re.search(r'y', line):
-                print(line)
+def grep(regex, filename):
+    with open(filename) as fh:
+        for line in fh:
+            if re.search(regex, line):
+                print(line, end='')
 
+i = int(limit)
+
+while i:
+    i-=1
+    grep('y', filename)
diff --git a/examples/grep_speed.sh b/examples/grep_speed.sh
@@ -1,7 +1,13 @@
+#!/bin/bash
+
+if (($# != 2)); then
+    echo "$0" 'FILENAME LIMIT' >&2
+    exit 1
+fi
+
 filename=$1
 limit=$2
 
-for ((i=1;i<=$limit;i++));
-do
-    grep y $filename
+for ((i=limit; i; --i)); do
+    grep y "$filename"
 done
diff --git a/examples/grep_speed_open_once.py b/examples/grep_speed_open_once.py
@@ -0,0 +1,20 @@
+import sys
+import re
+
+if len(sys.argv) != 3:
+    exit(f"{sys.argv[0]} FILENAME LIMIT")
+
+_, filename, limit = sys.argv
+
+def grep(regex, fh):
+    for line in fh:
+        if re.search(regex, line):
+            print(line, end='')
+
+i = int(limit)
+
+with open(filename) as fh:
+    while i:
+        i-=1
+        grep('y', fh)
+        fh.seek(0)
diff --git a/examples/grep_speed_optimized.py b/examples/grep_speed_optimized.py
@@ -0,0 +1,21 @@
+import sys
+import re
+
+if len(sys.argv) != 3:
+    exit(f"{sys.argv[0]} FILENAME LIMIT")
+
+_, filename, limit = sys.argv
+
+def grep(regex, fh):
+    for line in fh:
+        if regex.search(line):
+            print(line, end='')
+
+i = int(limit)
+
+y = re.compile('y')
+with open(filename) as fh:
+    while i:
+        i-=1
+        grep(y, fh)
+        fh.seek(0)
diff --git a/examples/grep_speed_oxo.py b/examples/grep_speed_oxo.py
@@ -6,9 +6,16 @@
 
 _, filename, limit = sys.argv
 
-with open(filename) as fh:
+def grep(regex, fh):
     for line in fh:
-        for _ in range(int(limit)):
-            if re.search(r'(.)y\1', line):
-                print(line)
+        if regex.search(line):
+            print(line, end='')
+
+i = int(limit)
 
+y = re.compile(r'(.)y\1')
+with open(filename) as fh:
+    while i:
+        i-=1
+        grep(y, fh)
+        fh.seek(0)
diff --git a/examples/grep_speed_oxo.sh b/examples/grep_speed_oxo.sh
@@ -1,7 +1,13 @@
+#!/bin/bash
+
+if (($# != 2)); then
+    echo "$0" 'FILENAME LIMIT' >&2
+    exit 1
+fi
+
 filename=$1
 limit=$2
 
-for ((i=1;i<=$limit;i++));
-do
-    grep '\(.\)y\1' $filename
+for ((i=limit; i; --i)); do
+    grep '\(.\)y\1' "$filename"
 done
diff --git a/examples/grep_speed_oxo_unoptimized.py b/examples/grep_speed_oxo_unoptimized.py
@@ -0,0 +1,19 @@
+import sys
+import re
+
+if len(sys.argv) != 3:
+    exit(f"{sys.argv[0]} FILENAME LIMIT")
+
+_, filename, limit = sys.argv
+
+def grep(regex, filename):
+    with open(filename) as fh:
+        for line in fh:
+            if re.search(regex, line):
+                print(line, end='')
+
+i = int(limit)
+
+while i:
+    i-=1
+    grep(r'(.)y\1', filename)
diff --git a/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt b/sites/en/pages/compare-the-speed-of-grep-with-python-regex.txt
@@ -9,7 +9,7 @@
 
 =abstract start
 
-At one of my client we had a Bash script that grepped a huge log file 20 times in order to generate a report.
+One of my clients had a Bash script that grepped a huge log file 20 times in order to generate a report.
 It created a lot of load on the server as <b>grep</b> was reading the entire file 20 times.
 
 As we were converting our Shell scripts to Python anyway I thought I could rewrite it in Python and go over the file
@@ -31,21 +31,34 @@ We can run it like this, indicating the name of the file we would like to create
 the number of rows and the length of rows.
 
 <code>
-python create-big-file.py FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS
+$ python create-big-file.py FILENAME NUMBER-OF-ROWS LENGTH-OF-ROWS
+</code>
+
+For example:
+
+<code>
+$ python create-big-file.py a.txt 100000 50
 </code>
 
 It will create a file full of the character "x", with a single "y" somewhere.
 
+<code>
+$ wc a.txt
+ 1000000  1000000 51000000 a.txt
+$ grep y a.txt
+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx
+</code>
+
 I think this is going to be good enough for our simple example.
 
 <h2>Using grep</h2>
 
 In the original shell script we had some 20 different calls to <b>grep</b>,
-but to make it simpler I made this shell script with that runs the same regex multiple times.
+but to make it simpler I made this shell script, which runs the same regex multiple times.
 
 <include file="examples/grep_speed.sh">
 
-You can pass the name of the data file and the number of time you'd like to run <b>grep</b>.
+You can pass the name of the data file and the number of times you'd like to run <b>grep</b>.
 
 <h2>Grep with Python regexes</h2>
 
@@ -61,46 +74,59 @@ and thous would be probably faster, but in our cases we really had more complex
 
 <h2>Comparing the speed</h2>
 
+Here are the results of running the grep test:
+
 <code>
-python create-big-file.py a.txt 100000 50
+$ time bash examples/grep_speed.sh a.txt 20 >/dev/null
+
+real    0m0.355s
+user    0m0.238s
+sys     0m0.097s
 </code>
 
-Verify the file:
 
 <code>
-$ wc a.txt
- 1000000  1000000 51000000 a.txt
-</code>
+$ time python examples/grep_speed.py a.txt 20 >/dev/null
 
-<code>
-# grep y a.txt
-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx
+real    0m9.897s
+user    0m9.772s
+sys     0m0.120s
 </code>
 
+So, <b>grep</b> is upwards of 30 times faster than Python; what if we optimize the Python code by only opening the file once?
+
+<include file="examples/grep_speed_open_once.py">
+
+Making that change did almost nothing to improve the speed.
+
 <code>
-$ time bash examples/grep_speed.sh a.txt 20
+$ time python examples/grep_speed_open_once.py a.txt 20 >/dev/null
 
-real  0m0.227s
-user  0m0.055s
-sys   0m0.172s
+real    0m9.712s
+user    0m9.625s
+sys     0m0.082s
 </code>
 
+What if we optimize the regular expression by compiling it only once?
+
+<include file="examples/grep_speed_optimized.py">
+
+That makes a signicant improvement!
 
 <code>
-$ time python examples/grep_speed.py a.txt 20
+$ time python examples/grep_speed_optimized.py a.txt 20 >/dev/null
 
-real  0m9.509s
-user  0m9.477s
-sys   0m0.032s
+real    0m2.198s
+user    0m2.121s
+sys     0m0.075s
 </code>
 
-
-<b>grep</b> is about 50 times faster than Python even though <b>grep</b> had to read the file 20 time while Python only read it once.
+By pre-compiling the regular expression, the Python code is now about 4.5x faster than the unoptimized Python code; however, <b>grep</b> is <em>still</em> about 6 times faster than Python, even though <b>grep</b> must start from afresh on each iteration.
 
 
 <h2>More complex grep</h2>
 
-In the previous case we used a very simple regex, now let's change it to use a slightly more complex expression
+In the previous case we used a very simple regex; now, let's change it to use a slightly more complex expression
 in which we are not only looking for a single character, but we also want to make sure it is between two
 identical characters.
 
@@ -113,30 +139,43 @@ identical characters.
 You can try it yourself:
 
 <code>
-grep '\(.\)y\1' a.txt
+$ grep '\(.\)y\1' a.txt
+xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxyx
 </code>
 
 
 <h2>Comparing the speed of the more complex examples</h2>
 
 <code>
-$ time bash examples/grep_speed_oxo.sh a.txt 20
+$ time bash examples/grep_speed_oxo.sh a.txt 20 >/dev/null
 
-real   0m0.196s
-user   0m0.035s
-sys    0m0.161s
+real    0m0.413s
+user    0m0.297s
+sys     0m0.097s
 </code>
 
 
 <code>
-$ time python examples/grep_speed_oxo.py a.txt 20
+$ time python examples/grep_speed_oxo.py a.txt 20 >/dev/null
 
-real   0m25.067s
-user   0m24.972s
-sys    0m0.016s
+real    0m12.724s
+user    0m12.589s
+sys     0m0.128s
 </code>
 
-The speed of <b>grep</b> did not change, but Python became even slower. This time grep is more than a 100 times faster than Python.
+The speed of <b>grep</b> did not change appreciably, but the Python code became much slower; this time, <b>grep</b> is more than a 30 times faster than Python, despite using some explicit optimizations in the Python code. How does the unoptimized code fair?
+
+<include file="examples/grep_speed_oxo_unoptimized.py">
+
+Using <b>grep</b> is about 57 times faster than using the unoptimized Python code.
+
+<code>
+$ time python examples/grep_speed_oxo_unoptimized.py a.txt 20 >/dev/null
+
+real    0m23.448s
+user    0m23.319s
+sys     0m0.114s
+</code>
 
 <h2>Version information</h2>
 
@@ -147,21 +186,21 @@ Python 3.8.2
 
 <code>
 $ grep -V
-grep (GNU grep) 3.4
+grep (GNU grep) 3.3
 </code>
 
 
 <h2>Other cases</h2>
 
-The results are consistent with what I saw during my work, but I wonder what would be the results if the file was larger than the available memory in my computer.
+The results are consistent with what I saw during my work, but I wonder what the results would be if the file were larger than the available memory in my computer.
 
 <h2>Conclusion</h2>
 
 <b>grep</b> is so much faster than the regex engine of Python that even reading the whole file several times does not matter.
 
 Or I made a mistake somewhere that impacts the results.
 
-Oh and one more thing, I also create a <a href="https://perlmaven.com/">Perl</a> version of the code and
+Oh and one more thing, I also created a <a href="https://perlmaven.com/">Perl</a> version of the code and
 <a href="https://perlmaven.com/compare-the-speed-of-perl-and-python-regex">Perl is much faster than Python</a>
 even though it is also slower than the <b>grep</b> code.