This coding example is really about how to figure out what is faster. In some cases regular expressions may be the optimal solution but as a rule of thumb they are suboptimal for the simpler cases. e.g. starts with, ends with, count of occurrences, fixed string search, etc.
I wrote this in response to a post on stackoverflow.com asking for help in crafting a regex to eliminate words from an input file by only printing words that are each made up of only unique letters. i.e. with one or less of each letter. This code is probably more interesting to the Pythonista than to developers of other languages as there are many terser solutions in other languages including a simple egrep.
The input file contains one word per a line.
Contents of example wordlist.txt
abducts
abe
abeam
abel
abele
Example output:
abducts
abe
abel
Code that produces that output written in python using a regex:
python -c 'import re, sys; print "".join(s for s in open(sys.argv[1]) if not re.match(r".*(\w).*\1", s))' wordlist.txt
In python without a regex:
python -c 'import sys; print "".join(s for s in open(sys.argv[1]) if len(s) == len(frozenset(s)))' wordlist.txt
I performed some timing tests with a hardcoded file name and output redirected to /dev/null to avoid including screen output in the timing:
Timings without the regex:
python -m timeit 'import sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if len(s) == len(frozenset(s)))' 2>/dev/null
10000 loops, best of 3: 91.3 usec per loop
Timings with the regex:
python -m timeit 'import re, sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if re.match(r".*(\w).*\1", s))' 2>/dev/null
10000 loops, best of 3: 105 usec per loop
The regex is a bit slower than a simple frozenset creation and len comparison in python. The timeit module can be used in this way to test code in modules by importing the module or script and calling the functions you wish to test.
And here is the obligatory long hand versions for those that have a need for a more flexible timing test with the added benefit of not having the compilation time considered part of the timing.
filter_words.py:
#!/usr/bin/env python
"""
Filter out words with more than one of a particular letter.
"""
import re
import sys
def filter_with_regex(input_generator):
for w in input_generator:
if not re.match(r".*?(\w).*?\1", w):
yield w
def filter_without_regex(input_generator):
for w in input_generator:
if len(w) == len(frozenset(w)):
yield w
def filter_words(filter_generator, input_generator):
return filter_generator(input_generator)
def main(filter_generator=filter_without_regex, file_name=sys.argv[1]):
ig = open(file_name)
fg = filter_generator
return filter_words(fg, ig)
if __name__ == '__main__':
from timeit import Timer
t_without = Timer("main(filter_without_regex)", "from __main__ import main, filter_without_regex")
t_with = Timer("main(filter_with_regex)", "from __main__ import main, filter_with_regex")
t = (1000000*t_without.timeit(100000)/100000, 1000000*t_with.timeit(100000)/100000)
print "{0:0.2f} usec/pass without regex\n{1:0.2f} usec/pass with regex".format(*t)
Run the timing tests as ‘python filter_words.py wordlist.txt’ or at the ipython prompt as ‘%run filter_words.py wordlist.txt’.