Tuesday, September 27, 2022

More statistical analysis

I wrote a few days ago about performing statistical analysis on a cohort of slightly more than 600 people who have taken the OP's flagship exam twice, concluding with the words [I used] a very painstaking method which is also v-e-r-y slow. One reason for this lack of speed was a query that was opened once for each person, that would retrieve the score for a given scale from both exams at the same time. I thought previously that this would be faster than opening a more naive query twice for each person, in which the score for the first exam would be retrieved, followed by the score for the second exam. The number of times that the sophisticated query (with a few self joins) would be opened is the number of people times the number of scales measured. The naive query gets opened twice that number.

At some stage, I decided to allow the user to choose for which scale/s the calculations would be calculated, instead of iterating over all the scales (about 40). This certainly made the debugging easier!

I knew that one of the problems with my standard code for calculating means and standard deviations is that there are people who have values for the first exam for a given scale but not for the same scale in the second exam, which is why I was using this 'painstaking method'. I then suspected that a sophisticated query (that included a check that there were values for the same person for the same scale in both exams) might be doing more than it needs to be doing. So I split the work into two: first there is a query that returns for everybody the value from the first exam for a given scale (this is already faster), then for each person the value from the second exam is retrieved. The calculations are based only on those who have a value for the second exam. This made a great improvement in the time for any given scale, even when everybody who had a value for the first exam had a value for the second. But this was small change (a double entendre).

The biggest time saver came from the way that I was accessing the database. Both the z-test (comparison of two means) and correlation are parametric tests, in that they depend on means and standard deviations. So I could calculate the values once for every scale then use them for these tests, both of which are dependent on the difference between a given value and the mean (e.g. the mean score for a given scale is 31.2; the difference between a person's score and this mean is required). The standard deviation is based on the square of the difference between a score and the mean, whereas correlation is dependent on the difference in one test multiplied by the difference in the second test.

My standard SQL code for calculating the mean can be done in a single query, but I couldn't use this method. So I was retrieving the score for each person, calculating the mean, then retrieving the score again and subtracting this from the mean in order to calculate the difference. Why was I hitting the database again to retrieve the data that I had already retrieved? I wrote some simple code to store the retrieved score in an array; further manipulations would be performed on the array instead of accessing the database a second time. The only problem with this is declaring in advance the size of the array; I chosed 1000 elements, allowing 40% spare room at the moment.

This simple change caused a huge speed-up! I had been using a progress bar to show the progress of retrieving values; this now worked so fast that the progress bar didn't have time to update. 0 to 100 in a second! So I dropped this progress bar as it was only adding overhead.

Now that the calculation for a scale works very fast, it was time to look at the results. The z-values for the scales were all very low, showing that there was basically no difference in the means of the two exams, or in other words, the exam itself was valid. But the correlations led to a slightly different conclusion; most had values between 0.60 and 0.75, meaning a strong correlation, but not as strong as the z-values. One scale had a correlation factor of only 0.31, which is considered a weak correlation. Both the means and standard deviations for this scale were nearly the same, so I asked myself from where was coming the difference.

I looked at the raw data: apart from 3 people whose score changed greatly between the exams, most people had near enough the same scores. In some cases, the score was slightly higher in the first exam and in other cases, it was slightly higher in the second exam (there were also a few cases with the same score in both exams). This explains why the correlation factor was so low: a high factor would mean that everybody's score improved, but because some 'improved' and some 'regressed', not everybody is in step and so the factor is low. As far as I am concerned, the z-test is more appopriate than correlation, in this case.

As I wrote to the OP, it's not as if there were a group of people who took an exam with well-defined answers, got their marked paper back, worked on their wrong answers, then took the exam again. One would expect everybody's score to improve, albeit by different degrees; the correlation factor would be a mathematical way of expressing 'by different degrees'. But here there are no correct answers but rather a measure of how they feel at the time that they took the exam. 

Now that I think about it even more, this does make some sense: the problematic scale supposedly measures people's attitude towards communality. One would expect that the attitude of people who had a 'poor COVID-19 experience' (especially if they were hospitalised) to communality would have worsened when 'after' is compared to 'before', as opposed to those who had  a more positive experience to the pandemic (e.g. myself) might be expected to increase with regard to communality.

No comments: