Python Performance Tuning

Having just built a solver for KenKen puzzles, this is a good opportunity to take a look at performance tuning in Python. Today we’ll look at some basic instrumentation, and begin searching for bottlenecks in the solver.

Preliminaries

For the purposes of this post, I’m assuming that:

* The solver code is stored in a file named neknek_1.py
* The neknek_1.py file is on the import path
* A 9×9 test puzzle is stored in a file named 9×9.txt
* The 9×9.txt file is in the current working directory

Here is the 9×9 test puzzle I’ll be using for performance measurement:

#	9	
*	240	A1 B1 A2
+	12	B2 A3 B3
!	9	A4
-	5	A5 A6
*	72	A7 B7
+	7	A8 B8 C8
/	2	A9 B9
*	56	C3 B4 C4
+	11	B5 B6
/	2	C1 C2
+	10	C5 C6 D6
*	42	C7 D7 D8
+	27	C9 D9 E9 F9 G9
/	2	D1 E1
+	19	D2 D3 D4
+	16	D5 E5 F5
!	3	E2
-	1	E3 E4
+	24	E6 E7 E8
*	252	F1 G1 G2 G3
+	6	F2 F3 F4
*	216	F6 G5 G6
+	15	F7 G7 F8
+	16	G4 H4 I4
*	90	G8 H8 I8
+	23	H1 I1 I2
+	15	H2 H3 I3
-	3	H5 H6
*	12	H7 I7
-	7	H9 I9
-	2	I5 I6

Timeit

The simplest Python tuning tool is probably the timeit module. From PythonWin it would be invoked this way:

>>> import timeit
>>> timeit.Timer('neknek_1.solve(neknek_1.Puzzle("9x9.txt"))', 'import neknek_1').timeit(number=10)/10

This code will load and solve the 9×9 puzzle 10 times, and return the average time (as a float, in seconds) that a single load-and-solve loop took.

Timeit is a crude but effective way to see the impact of optimizations on a macro level. On my machine, at the moment, it returns an estimate of 0.45s per 9×9 puzzle. (Note that this is significantly lower than what I observed yesterday, indicating that some care must be used when evaluating the results of this module.)

Hotshot

The hotshot module a true profiler; it will log individual function entry/exit information during program execution, writing the results to a file. The output file can be parsed by tools packaged with the hotshot module. From PythonWin, the hotshot profiler would be invoked this way:

>>> import hotshot
>>> import neknek_1
>>> prof = hotshot.Profile('hotshot_nn_stats')
>>> prof.run('neknek_1.solve(neknek_1.Puzzle("9x9.txt"))').close()

To view a summary of the profiler’s results, use these commands:

>>> import hotshot.stats
>>> hotshot.stats.load('hotshot_nn_stats').strip_dirs().sort_stats('time').print_stats()

which will output:

         297518 function calls (36546 primitive calls) in 3111.408 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
133583/11855 1221.719    0.009 2028.554    0.171 neknek_1.py:18()
115588/4363  923.962    0.008 2145.681    0.492 neknek_1.py:15(can_make_sum_p)
       44  207.178    4.709 3086.981   70.159 neknek_1.py:112(constrain)
14470/1329  150.414    0.010  283.972    0.214 neknek_1.py:21(can_make_product_p)
18308/3559  133.558    0.007  247.684    0.070 neknek_1.py:24()
      640  131.804    0.206 2760.608    4.313 neknek_1.py:44(apply)
     1000  118.984    0.119  118.984    0.119 neknek_1.py:78(apply)
     2450   59.254    0.024 2570.519    1.049 neknek_1.py:48()
     2528   58.285    0.023   58.285    0.023 neknek_1.py:45()
     4378   54.718    0.012 2200.400    0.503 neknek_1.py:56(_test_component)
     1392   18.611    0.013  302.583    0.217 neknek_1.py:67(_test_component)
     44/1    7.363    0.167 2112.332 2112.332 neknek_1.py:132(search)
     1213    6.575    0.005    6.575    0.005 neknek_1.py:137()
      910    5.532    0.006    5.532    0.006 neknek_1.py:63(_test_component)
     66/2    2.864    0.043 2111.620 1055.810 neknek_1.py:139()
      288    2.750    0.010    2.750    0.010 neknek_1.py:74(_test_component)
        1    2.416    2.416 3109.565 3109.565 neknek_1.py:100(solve)
        1    1.215    1.215    1.740    1.740 neknek_1.py:91(__init__)

	... clipped ...

Initial Observations

The first thing to notice is the absolutely stupendous amount of time devoted to can_make_sum_p() – nearly 2/3rds of the execution time is devoted to this single utility function. If you recall, it’s source code is:

# Can we select exactly one member from each set s.t. the sum of all selected members is t?
def can_make_sum_p(t, sets):
    if (not sets): return (t == 0)
    head = sets[0]; tail = sets[1:]
    return any(can_make_sum_p(t-e, tail) for e in head if e <= t)

Let's see if we can speed this up a little.

Base Case

The simplest thing to address is the base case. Right now, the base case of the recursive call is reached when sets is empty; at that point, the function returns True iff t equals 0. This is a bit silly; an (almost) equally good base case is reached when sets contains only one iterable; the function should return True iff t is a member of this iterable. The code:

# Can we select exactly one member from each set s.t. the sum of all selected members is t?
def can_make_sum_p(t, sets):
    if (len(sets) == 1): return (t in sets[0])
    head = sets[0]; tail = sets[1:]
    return any(can_make_sum_p(t-e, tail) for e in head if e <= t)

With this single change, timeit now reports a per-loop time of 0.32s, and the hotshot profiler reports these results:

         94797 function calls (35465 primitive calls) in 1377.959 CPU seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
21749/4177  282.142    0.013  456.155    0.109 neknek_1.py:15(can_make_sum_p)
       44  179.129    4.071 1354.233   30.778 neknek_1.py:112(constrain)
24660/11003  174.013    0.007  340.939    0.031 neknek_1.py:18()
14592/1357  160.274    0.011  308.033    0.227 neknek_1.py:21(can_make_product_p)
18409/3670  147.759    0.008  271.829    0.074 neknek_1.py:24()
      635  121.681    0.192 1072.522    1.689 neknek_1.py:44(apply)
     1039  102.325    0.098  102.325    0.098 neknek_1.py:78(apply)
     2424   57.304    0.024  897.517    0.370 neknek_1.py:48()
     2489   53.324    0.021   53.324    0.021 neknek_1.py:45()
     4192   49.542    0.012  505.696    0.121 neknek_1.py:56(_test_component)
     1422   18.537    0.013  326.570    0.230 neknek_1.py:67(_test_component)
     44/1    7.163    0.163 1072.548 1072.548 neknek_1.py:132(search)
     1213    6.361    0.005    6.361    0.005 neknek_1.py:137()
      947    5.310    0.006    5.310    0.006 neknek_1.py:63(_test_component)
      256    2.637    0.010    2.637    0.010 neknek_1.py:74(_test_component)
     66/2    2.626    0.040 1071.892  535.946 neknek_1.py:139()
        1    2.513    2.513 1376.102 1376.102 neknek_1.py:100(solve)
        1    1.228    1.228    1.761    1.761 neknek_1.py:91(__init__)

	... clipped ...

As you can see, changing the base case results in an ~84% reduction in the calls to can_make_sum_p(), and an ~29% reduction in execution time.

Coming Attractions

Tomorrow we'll explore further optimizations, using the same tools.

Share and Enjoy:

Preliminaries

Timeit

Hotshot

Initial Observations

Base Case

Coming Attractions

Services

Find Stuff

Pages

Buy My Apps

Other Stuff I’ve Built

Book Club

Archives

Categories

Blogroll