How to determine the performance hit of regular expressions

This coding example is really about how to figure out what is faster. In some cases regular expressions may be the optimal solution but as a rule of thumb they are suboptimal for the simpler cases. e.g. starts with, ends with, count of occurrences, fixed string search, etc.

I wrote this in response to a post on stackoverflow.com asking for help in crafting a regex to eliminate words from an input file by only printing words that are each made up of only unique letters. i.e. with one or less of each letter. This code is probably more interesting to the Pythonista than to developers of other languages as there are many terser solutions in other languages including a simple egrep.

The input file contains one word per a line.

Contents of example wordlist.txt
abducts
abe
abeam
abel
abele

Example output:
abducts
abe
abel

Code that produces that output written in python using a regex:

python -c 'import re, sys; print "".join(s for s in open(sys.argv[1]) if not re.match(r".*(\w).*\1", s))' wordlist.txt

In python without a regex:

python -c 'import sys; print "".join(s for s in open(sys.argv[1]) if len(s) == len(frozenset(s)))' wordlist.txt

I performed some timing tests with a hardcoded file name and output redirected to /dev/null to avoid including screen output in the timing:

Timings without the regex:

python -m timeit 'import sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if len(s) == len(frozenset(s)))' 2>/dev/null

10000 loops, best of 3: 91.3 usec per loop

Timings with the regex:

python -m timeit 'import re, sys' 'print >> sys.stderr, "".join(s for s in open("wordlist.txt") if re.match(r".*(\w).*\1", s))' 2>/dev/null

10000 loops, best of 3: 105 usec per loop

The regex is a bit slower than a simple frozenset creation and len comparison in python. The timeit module can be used in this way to test code in modules by importing the module or script and calling the functions you wish to test.

And here is the obligatory long hand versions for those that have a need for a more flexible timing test with the added benefit of not having the compilation time considered part of the timing.

filter_words.py:

#!/usr/bin/env python
"""
Filter out words with more than one of a particular letter.
"""

import re
import sys

def filter_with_regex(input_generator):
  for w in input_generator:
    if not re.match(r".*?(\w).*?\1", w):
      yield w

def filter_without_regex(input_generator):
  for w in input_generator:
    if len(w) == len(frozenset(w)):
      yield w

def filter_words(filter_generator, input_generator):
  return filter_generator(input_generator)

def main(filter_generator=filter_without_regex, file_name=sys.argv[1]):
  ig = open(file_name)
  fg = filter_generator
  return filter_words(fg, ig)

if __name__ == '__main__':
  from timeit import Timer
  t_without = Timer("main(filter_without_regex)", "from __main__ import main, filter_without_regex")
  t_with = Timer("main(filter_with_regex)", "from __main__ import main, filter_with_regex")
  t = (1000000*t_without.timeit(100000)/100000, 1000000*t_with.timeit(100000)/100000)
  print "{0:0.2f} usec/pass without regex\n{1:0.2f} usec/pass with regex".format(*t)

Run the timing tests as ‘python filter_words.py wordlist.txt’ or at the ipython prompt as ‘%run filter_words.py wordlist.txt’.

Advertisements

About freegnu

freegnu and other stuff too
This entry was posted in programming, python. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s