Sunday, October 25, 2009

The importance of building indexes

Or as I would say, with the benefit of a classical education, the importance of building indices.

I have written a psychological testing application, in which the user is presented with a list of words, and s/he has to choose ten words which very much describe himself, then choose words which partially describe himself, and words which do not describe himself. The application itself works fine, but I was interested in exploring the meta-data possibilities: which words have been most frequently chosen in the first category, and which words have never been chosen in the first category. The first query was not a problem, but the second (which words have never been chosen) leaves me stumped.

The table structure is as follows:
table words: id, name
table choices: pid (person id), wid (word id), class (value between 1-6)

Presumably the answer involves a left join between words and choices, but there has to be a modifying statement - where choices.class = 1 - and this is causing me problems.
I posted this question on the excellent Stack Overflow site, and received some interesting answers. One person suggested the following query:

SELECT name
FROM words
WHERE id NOT IN
(SELECT DISTINCT wid  
FROM choices
WHERE class = 1)

When I ran this query, it took a staggeringly long 36 seconds to produce the answer! I had been using a two query strategy which took maybe one or two seconds, so to receive a pure SQL answer which took much longer seemed beside the point. When I responded, telling how long the query was taking, it was suggested that I add an index on the fields (wid, class) to the choices table.

Once I did this, the execution time dropped to 62 milliseconds. Yes, that's right: 580 times faster!! Later someone suggested this query

SELECT name
FROM words
WHERE NOT EXISTS
(SELECT 1
FROM choices
WHERE choices.class = 1 and choices.wid = words.id)

I don't know how long this would have taken without the index, but with the index it took only 31 milliseconds. When running the program, it seems that the data appears before they have been requested. This query is fascinating in its simplicity and I shall try to remember it. It's actually the same syntax which is used in a report at work to discover parts which don't have a bill of materials.

So: it's very important to index fields which are used in a query. Armed with this knowledge, I checked a few other queries, both in this application and in our flagship app. After adding a few indices (all right, indexes), one query became three times faster and another twice as fast.

Even without the speed-up gained by the index, there is another point to be considered: the algorithm. In one part of the program (effectively the most important part, where the user's results are displayed, so speeding this part up is much more important that improving a non-essential query), I was calling the same query in a loop which ran six times (each loop invocation would call the query with a different parameter). It wasn't too difficult to see that I could remove the loop, call the query once for all the data, and then massage the returned data in code.

My attitude has always been to get the program running first with simple code and correct results; sometimes I forget to revise the program and substitute more complicated code which runs much faster.

No comments: