Saturday, October 03, 2015

More statistics functions with SQL

There are  three  commonly  used  measures  of  central  tendency: the  mode, the median and the mean. The mode describes the most common score, the median the score of the middle case, and the mean the average score. The measures available for analysis depend on the type of variable.


 In my data, the 'results' table contains the results (naturally) for a given section (or variable). Since data from nominal and ordinal variables cannot be aggregated (see previous post), each section corresponds to a single question, thus the results (or scores) are actually the options chosen for that question.

In my pilot study, the results for the following ordinal question were
Q45: What is your highest level of education?
1. Ninth grade (O-levels) - 0
2. A levels - 1
3. Professional diploma - 3
4. Bachelor degree - 8
5. Masters degree - 4
6. Doctorate - 0

The mode is clearly 8 - bachelor degree. This can be determined by a fairly easy SQL query:
select score, count (*) from results where section = :p1 group by score order by count (*) desc
where p1 is a parameter whose value is the number of the section being evaluated (for education, it's 12). The query will return several rows, depending on how many options exist for the question, but only the first row (which will be 4, 8) would be used.

If one arranges the values by score, they are 1333444444445555. As there are 16 values, the median lies between the 8th and 9th value, which in both cases is 4. Obtaining the median via SQL without using a cursor is fairly tricky - it involves the use of two less frequently used SQL keywords: first and skip. The query which I came up with is
select first 1 skip 8 score from results where section = :p1 order by score
The number following first defines the number of rows to be returned - in this case, one. The number following skip defines how many rows in the original dataset should be skipped before applying first: translated into English, the query says order the values, skip the first eight values then return the next one. In other words, return the ninth value - the median - when the rows are ordered.

The reason that the '8' is on a separate row is because it cannot be parameterised; it has to be added dynamically into the SQL query during preparation; isolating this value simplifies the addition.

No comments: