Tuesday, September 27, 2022

More statistical analysis

I wrote a few days ago about performing statistical analysis on a cohort of slightly more than 600 people who have taken the OP's flagship exam twice, concluding with the words [I used] a very painstaking method which is also v-e-r-y slow. One reason for this lack of speed was a query that was opened once for each person, that would retrieve the score for a given scale from both exams at the same time. I thought previously that this would be faster than opening a more naive query twice for each person, in which the score for the first exam would be retrieved, followed by the score for the second exam. The number of times that the sophisticated query (with a few self joins) would be opened is the number of people times the number of scales measured. The naive query gets opened twice that number.

At some stage, I decided to allow the user to choose for which scale/s the calculations would be calculated, instead of iterating over all the scales (about 40). This certainly made the debugging easier!

I knew that one of the problems with my standard code for calculating means and standard deviations is that there are people who have values for the first exam for a given scale but not for the same scale in the second exam, which is why I was using this 'painstaking method'. I then suspected that a sophisticated query (that included a check that there were values for the same person for the same scale in both exams) might be doing more than it needs to be doing. So I split the work into two: first there is a query that returns for everybody the value from the first exam for a given scale (this is already faster), then for each person the value from the second exam is retrieved. The calculations are based only on those who have a value for the second exam. This made a great improvement in the time for any given scale, even when everybody who had a value for the first exam had a value for the second. But this was small change (a double entendre).

The biggest time saver came from the way that I was accessing the database. Both the z-test (comparison of two means) and correlation are parametric tests, in that they depend on means and standard deviations. So I could calculate the values once for every scale then use them for these tests, both of which are dependent on the difference between a given value and the mean (e.g. the mean score for a given scale is 31.2; the difference between a person's score and this mean is required). The standard deviation is based on the square of the difference between a score and the mean, whereas correlation is dependent on the difference in one test multiplied by the difference in the second test.

My standard SQL code for calculating the mean can be done in a single query, but I couldn't use this method. So I was retrieving the score for each person, calculating the mean, then retrieving the score again and subtracting this from the mean in order to calculate the difference. Why was I hitting the database again to retrieve the data that I had already retrieved? I wrote some simple code to store the retrieved score in an array; further manipulations would be performed on the array instead of accessing the database a second time. The only problem with this is declaring in advance the size of the array; I chosed 1000 elements, allowing 40% spare room at the moment.

This simple change caused a huge speed-up! I had been using a progress bar to show the progress of retrieving values; this now worked so fast that the progress bar didn't have time to update. 0 to 100 in a second! So I dropped this progress bar as it was only adding overhead.

Now that the calculation for a scale works very fast, it was time to look at the results. The z-values for the scales were all very low, showing that there was basically no difference in the means of the two exams, or in other words, the exam itself was valid. But the correlations led to a slightly different conclusion; most had values between 0.60 and 0.75, meaning a strong correlation, but not as strong as the z-values. One scale had a correlation factor of only 0.31, which is considered a weak correlation. Both the means and standard deviations for this scale were nearly the same, so I asked myself from where was coming the difference.

I looked at the raw data: apart from 3 people whose score changed greatly between the exams, most people had near enough the same scores. In some cases, the score was slightly higher in the first exam and in other cases, it was slightly higher in the second exam (there were also a few cases with the same score in both exams). This explains why the correlation factor was so low: a high factor would mean that everybody's score improved, but because some 'improved' and some 'regressed', not everybody is in step and so the factor is low. As far as I am concerned, the z-test is more appopriate than correlation, in this case.

As I wrote to the OP, it's not as if there were a group of people who took an exam with well-defined answers, got their marked paper back, worked on their wrong answers, then took the exam again. One would expect everybody's score to improve, albeit by different degrees; the correlation factor would be a mathematical way of expressing 'by different degrees'. But here there are no correct answers but rather a measure of how they feel at the time that they took the exam. 

Now that I think about it even more, this does make some sense: the problematic scale supposedly measures people's attitude towards communality. One would expect that the attitude of people who had a 'poor COVID-19 experience' (especially if they were hospitalised) to communality would have worsened when 'after' is compared to 'before', as opposed to those who had  a more positive experience to the pandemic (e.g. myself) might be expected to increase with regard to communality.

Sunday, September 25, 2022

Two films

I watched two films on TV today; in both cases, I saw about 75% of each film, missing the beginning. There were similarities between the films: both were set in Seattle, one completely and one partially. Both featured widowed fathers to 10 year old boys; both had love interests who were involved to one extent or another with another man.

One film probably had a low budget; no actor was familiar to me. The other film probably had quite a high budget, with a name director and six name actors, all well known. One film had an almost totally believable story whereas the second had a totally unbelievable story. One film probably had low pretensions and came off quite well, whereas the other had high pretensions and I don't understand how people can rate this film so well.

One was 'Sleepless in Seattle'; the other was 'Hint of love'. One film you've all probably heard of, the other is completely unknown. I leave you to draw your own conclusions about which film was more successful.

Saturday, September 24, 2022

New Year greetings

As I wrote six years ago, by chance, it happens that the evening of the new Jewish year [will] fall on Sunday night; as a result, most people have a very long weekend - from Friday until Tuesday inclusive (or if you prefer, my last working day was on Thursday 22/9 and my next working day will be on Wednesday 28/9).

I've got a few bits and pieces to do for the daily job and for my consultancies; I also want to devote several hours to my doctoral thesis, and as a result, I hope that I don't have too many idle hours.

Yesterday and today I worked several hours for the OP; yesterday had some more clever ideas for the 'ERP' program, and today I worked on a statistical analysis program, calculating both z-values and correlations for people who have taken our flagship exam twice. Is there much difference between the two exams? The z-values don't show anything statistically interesting, but the correlations for some scales aren't very high. Due to the fact that some people took the exam twice but don't have values for all the scales both times (because we added some scales at a later date), I couldn't use my usual code for calculating means and standard differences. I had to use a very painstaking method which is also v-e-r-y slow.

Monday, September 19, 2022

I held an interview!

I see that I last wrote about my doctoral research in January, a long, long time ago. I had just come to the realisation "that I had a huge enhancement taking place underneath my nose" and so made structural changes to the thesis in preparation for researching this enhancement (the warehouse management system, aka WMS). I didn't realise at the time how long it would take for the WMS to reach a stage where I could interview people about it.

It took a few months for us to upgrade our ERP system successfully (upgrading the standard system wasn't a problem, but we have a few external programs that are essential to how we work and these had to be modified to work with the new version); then we discovered that the database manager (i.e. SQL Server) had to be updated; then we had problems with the wireless network and the program that prints labels .....

I won't go into all the problems that we faced, but basically we only started working with the system in the middle of August, which was at least three months later than I had expected. I had been looking at the calendar with a worried expression on my face, as I have until the end of December to complete the research, write it up, reach conclusions and then finally submit the thesis.

About a week ago, I could see that the system had reached a state suitable for people to talk about it, even if it isn't 100% finished, and so I sent out letters to the four people that I intend to interview. Three answered affirmatively, although two have yet to agree on a date. The fourth person, the man who is running the WMS on the factory floor, hasn't answered yet, which is a shame, as I am very interested to hear what he has to say.

Today I interviewed the factory manager, who normally strikes me as distant and apathetic, but the interview was actually very good, and lasted for nearly 30 minutes. After I had asked all my questions, one more interesting question occurred to me: did the manager think that the long delay and slow pace of implementing the system was a help or a hindrance? I, for example, think that my slow pace in recording songs is actually beneficial as I get the time to consider and reflect on the songs before they are finished*. We agreed that the long time was beneficial, although it could have been shortened somewhat.

After completing the interview and finding the recording via Teams, I realised that there was no way that I was going to transcribe the interview. The pilot study from last year found that it took me about ten minutes to transcribe one minute of interview, so I was looking at nearly five hours work. I then started looking for the company that had transcribed some of the pilot interviews; at first I couldn't find anything (it didn't help that I started by looking in the wrong place), but eventually I found the receipt that I had scanned and from there got back in touch with the company and uploaded the file.

I wonder how long it's going to take for me to get the transcript returned: next week is the Jewish New Year which means several days off work. Then it will be Yom Kippur and following that another week or more off work. This is a double-edged sword: whilst I have time off work, the transcribers also have time off work which might well delay the return of the transcripts. I had looked forward to using the time on my hands to advance the thesis greatly; I suppose that I can listen to the interview(s) without having the transcript(s) and work on that basis.


(*) I have been working on the music for a new song over the past few weeks. No words, of course, and no idea about what the song is going to be about. Every couple of days I think that I have completed the music, but then I listen to it again and have some more ideas. So definitely I need time for contemplation and reflection on music.

Sunday, September 11, 2022

The Ink Black Heart/2

Rereading the book slowly is interesting, especially the online conversations between the moderators (one has to read the book to know what I'm writing about). I picked up several hints that I missed the first time around, because I had no idea what they were talking about.

And in the course of doing so, I've discovered a major mistake by the author in chapter 23, that consists entirely of online conversations. A very important turning point in the book (almost at the end) is that two characters never take part in the same conversation. Well, I'm sorry to say that those two characters do take part in the same conversation at least once. Oops.

On the other hand, Robin and Cormoran don't see these conversations .....

Tuesday, September 06, 2022

The Ink Black Heart

At the beginning of August, I received an email stating that the sixth book in the Strike series would be published at the end of the month (i.e. a few days ago). I wouldn't say that the anticipation was huge and that I couldn't fall asleep at night wondering about the book, but I did reread a few of the earlier books in the series and waited patiently until the end of the month.

I've just spent a few days slogging though the new book which is extremely long, apparently more than 1,000 pages in its printed version. Although the previous book, "Troubled blood", was also long, it was also attention keeping, whereas I had problems giving sufficient attention to TIBH. Woven throughout the book are transcriptions of online chats, and whilst most of these are important to the story, there are too many and too confusing (especially at the beginning). The language - and the abbreviations - used gave me problems. Maybe this is how the youth speak (or more accurately, write) these days. There were several terms/abbreviations that I had to look up on the Internet in order to understand them.

What makes the book long is the multiple storylines: apart from the online chats and the actual story line (who killed the creator of TIBH online cartoon), the book also tells about other cases that the agency is handling, about Strike's love-life (in every book he seems to get a new love - or at least, sex - interest), the never-ending will they/won't they Robin/Strike dynamic, along with bits and pieces from Robin's own life (now that she's divorced, other possibilities arise). Some of this, especially the bits connected with Strike's old flame, Charlotte, could easily have been removed.

It wasn't until the book was about half way over that it began to be interesting. I will, of course, reread the book, hopefully getting more out of it the second time.

Another take on the book can be found here: whilst I agree with most of the review, I don't particularly agree with the paragraph that begins And there is a problem with Cormoran Strike himself. He’s rude, violent and doesn’t understand women He certainly isn't ruder than any of the younger participants, I don't remember him being violent (or at least, not unnecessarily so) and I think that he is more misunderstood by women than he misunderstands them. [I should point out that this was the first review that I came upon, so it may not be representative of other reviews. I'm not trying to make a point on the basis of this review.]

One slightly strange (to me) aspect of the book is that it is set in 2015 and so continues the time-line from the earlier books. Was YouTube such a big thing in 2015? Were the abbreviations current? The setting seems more modern that this, although I appreciate that one can't have a five year gap in the story lines of two consecutive books without something to explain what happened in those five years. Actually, as opposed to the other books, there are very few references to the detective agency's previous cases, as opposed to events in the protagonists' private lives. But Robin turns 30 (and Strike 40) in the book, whereas a previous book told about her 29th birthday

Monday, September 05, 2022

Hard boiled eggs

My wife likes to eat hard boiled eggs (*). Instead of cooking them in a small saucepan with a lid, she cooks them in what's known here as a finjan, a small open pot that is normally used for making coffee in the oriental style. I pointed out to her that a great deal of heat is lost when the water boils, as it escapes as steam; it would be more efficient to put a lid on the pot to catch the steam.

But the finjan doesn't have a lid, so I would often balance a small plate on top of the finjan. Recently I did this, but the plate didn't balance and the pot fell to the floor, splashing me with boiling water. I treated the scald immediately (it hurt for about an hour), but I have been left with a red patch above my right knee that one day might fade away.

As a result of this misadventure, I resolved to find a better way of boiling eggs. I suggested the saucepan route, but it seems that we don't have a small enough saucepan; also, I believe that the boiling water causes cavitation that can break an egg's shell. I went looking on the internet for an alternative and found the device pictured above (it was on the website of an Israeli (?) gadgets site; my wife wanted something from there, and I continued looking to see what else they might have). 

I ordered the gadget (about 150 NIS, which these days is equivalent to maybe $40) and it arrived after two weeks. Unfortunately there was no explanation whatsoever as to how to use it (excluding the Mandarin written on the box). Viewing this as an intelligence test, instead of puzzling out how to use the gadget (and in fact, how to put it together), I took the alternative route and looked on the internet for a manual. I got many hits for the gadget but far fewer for instructions; eventually I found some and downloaded them. These aren't 100% accurate for my gadget (they seem to be for a slightly different model) but good enough for my purposes.

First one removes the clear perspex dome and the white egg caddy (pink in the picture). Using the tiny cup that comes with the gadget, one fills it with water (this is maybe one teaspoon-full, i.e. 15 ml) then pours the water onto the steel plate that is revealed under the caddy. One then replaces the caddy, places however many eggs are required (I think that six can be done at once) then replaces the dome. Finally one turns on the electricity.

There is a heater under the steel plate: this causes the water to turn to steam, thus cooking the eggs. Indeed, after about ten seconds the plastic dome was covered inside with condensing steam. After about 7 minutes, the gadget turned itself off - there is presumably some device like a thermostat that recognises when all the water has been turned to steam. I unplugged the gadget from the wall socket but let the eggs set there for a few more minutes as they cooked.

Despite the fact that there is no boiling water to create air bubbles that might break an egg's shell, one of the eggs that I cooked this morning had the tell-tale 'albumin bubble', presumably caused by the shell cracking slightly and the white of the egg leaking slightly. 

This gadget seems to be far safer and more efficient than boiling water on a stove.


* I used to like eating hard boiled eggs but now they seem to have a metallic taste. In the summer of 1972 and 1973/4, during the many field trips that I took in Israel, we used to have boiled eggs every day for breakfast. Someone would get a fire going (or we had gas rings), then we would fill a big pot (we used to call them 'tillys') with water and place eggs inside to be cooked. I had an unofficial competition going with our leader in 1972 as to who could peel an egg with the fewest pieces of shell. After some practice, I was able to peel an egg with only two pieces of shell, which I thought was the fewest number possible, but one day I peeled an egg leaving the shell intact! This was possible because directly underneath the shell there is a membrane to which the shell sticks; if one doesn't puncture the membrance, one can peel the egg leaving all the shell stuck to the membrane.


In 1979, whilst working in the kitchen in Mishmar David, one of my fellow cooks would make hard boiled eggs for breakfast. For this we used what is apparently called a 'steam kettle', similar to the one shown in the picture on the left. Every kibbutz kitchen had several of these in different sizes - on Mishmar David we had two really big ones, three smaller ones and one really small. We would put the eggs in a wire basket then put the basket in the smallest kettle in order to cook. The eggs that my fellow cook would make were far from hard boiled - they were so soft that I used to say that she showed the kettle to the eggs, then took them out to the dining room to be served.