Saturday, July 19, 2014

Statistics with SQL (Firebird)

Yesterday I was implementing a fairly advanced statistical analysis ("Testing a hypothesis about two population means when the combined sample size is at least 30", as my doctoral statistics text terms it), during which I realised that I could improve one phase of the calculation by changing my SQL queries. On the basis of that change, I decided to write an entry explaining some simple statistic measures and how to calculate them. I will use two approaches to the calculations, one naive and one (hopefully) sophisticated and measure the difference between the two.

The statistic measures are
  1. The number of data items (n)
  2. The mean (average) of the data items - the sum of the values divided by the number of items (m)
  3. The standard deviation of the data - this is a measure of how dispersed the data are from the mean (sd). This is obtained by initially calculating the difference between a datum and the mean, then squaring this number (as a result, all the values will be positive). These values are summed then divided by the number of data items; the result is called the variance. The standard deviation is the square root of the variance.
In normal distributions (frequently referred to as "the bell chart"), approximately 66% of the data will be between one standard deviation below the mean and one standard deviation above the mean. Approximately 95% of the data will be between two standard deviations below the mean and two standard deviations above the mean.

An everyday example of such a distribution is the IQ test. Unlike real distributions, the mean of the IQ is artificially fixed to be 100; the standard deviation is about 15. Thus 95% of the population have an IQ between 70 and 130.

Enough theory and onto the experiment. I am using as my data set a table which holds the results for all the people who have taken a certain exam. From one exam, several results are obtained, each for a particular scale. To simplify matters, I will take the results from only one scale. Thus the table structure is: pid (personal id), scale and num (test value), where pid and scale comprise the composite primary key.

The naive approach
We start with the simple data query select num from results where scale = 1. This will return one record per person. A simple loop will be sufficient to calculate both n and m:
 count, total: integer;
 mean, diff, sd: double;
 qResults: TSqlQuery;

 count:= 0;
 total:= 0;
 with qResults do
   while not eof do
     inc (count);
     inc (total, fields[0].asinteger);
I am deliberately not closing the query because it will be used again to calculate the standard deviation. At this stage, the variable count is n, and the mean is calculated by dividing total by count. Once the mean is known, then the standard deviation can be calculated in the following manner.
   mean:= total / count;
   diff:= 0.0;
   first; // iterate over the query again
   while not eof do
     diff:= diff + (fields[0].asinteger - mean) 
                 * (fields[0].asinteger - mean);
 sd:= sqrt (diff / mean)
When run against the database, the above code took 562 ticks on average. This can be divided into three sections: opening the query (15), calculating the mean (270) and calculating the standard deviation (277). This time is dependent on the size of the data set.

The sophisticated approach
Instead of using a simple query and calculating all the variables in code, SQL can be used to provide the calculations.

The number of data items can be calculated very simply with this query: select count (*) from results where scale = 1. The mean can be calculated in two ways: either the total is calculated (select sum (num) from results where scale = 1) and then divided by the number of data items, or it can be calculated directly (select avg (num) from results where scale = 1). Whichever approach is used, the two queries can be combined into one - select count (*), sum (num) from results where scale = 1.

Once the mean is known, the standard deviation can be calculated. This again can be achieved by using one query - select sum ((num - mean) * (num - mean)) from results where scale = 1, then dividing this by the number of values and then taking the square root. I would have liked to parameterise this query (ie passing mean as a parameter) in the form select ((num - :p1) * (num - :p1)) ... but unfortunately the syntax of Firebird doesn't seem to allow me to do this. So I have to use an underhand trick and dynamically build the select statement, including the mean as a magic number.

When run against the database, the above code takes 399 ticks on average. The run time of these queries should be much less dependent of the size of the data set. One benefit of running the queries on the SQL server is that there is much less network traffic - in fact, there is basically none in the sophisticated approach.

Following are the complete results from my tests. The timing is not particularly accurate as Windows is a multi-tasking environment and theoretically any event - such as either the anti-virus program or Dropbox waking up - would upset the timing. That's why I ran each test several times. The winner - by a clear margin - is the sophisticated approach, running in about 70% of the time of the naive approach (or 40% faster). I have to admit that this is less of a difference than I had expected.

Naive 1 2 3 4 average
Open query 15 16 15 15 15
calculate mean 282 265 250 282 270
calculate sd 297 281 266 266 277
Total 594 562 531 563 562

Sophisticated 1 2 3 4 average
calculate mean 187 188 188 172 184
calculate sd 219 203 203 234 215
Total 406 391 391 406 399

After I had run my tests, I decided to see what I could find on the Internet about calculating standard deviations. It turns out that Microsoft SQL server has two dedicated functions to do this (STDEV and STDEVP), the difference being that the first is used when calculating on a sample whereas the second is used when calculating on the entire population. There is a slight difference in the formula to calculate the standard deviation of a sample, but when the sample is large (I would say over 100), this difference can be ignored.

Firebird has no internal function to calculate the standard deviation. I read some discussion of trying to do this, but an elemental mistake was made: the code presented sums the difference between the value and mean instead of summing the square of the difference between the value and the mean. Unfortunately I have no idea when the above discussion took place and whether someone posted a reply, pointing out the error.

[SO: 3590; 2, 13, 35]
[MPP: 441; 0, 1, 6]

No comments: