I know that it's very sad, but for the past few weeks, I only seem to have read books about statistics*. Most of that time has been spend reading "Intermediate statistics for dummies" by Deborah Rumsey; this book has introduced me to such concepts as multiple regressional analysis, and I was very excited to read about using this technique with forward selection.
Just after having updated my draft research proposal with material on forward selection, I received some pages from my mentor on multiple regression in which he came out strongly against this technique. It was difficult at first to understand what his objection was, but after a few more readings of Rumsey's text, I began to realise that forward selection is maybe not such a good technique and that it would be better to replace it with the 'best subsets' technique. Here is a very good description of the technique.
As I write in my proposal, The final analysis necessary for dis/proving the hypotheses will be performed by multiple regressional analysis, using the 'best subsets' method. King (2003, p394) writes that "This approach allows the researcher to compare a number of models via summary statistics and then select one or more best sets of variables. Note that a computer program has not selected a model for the researcher". Both forward and backward stepwise methods suffer from the problem that once a variable has been included (or discarded), it cannot appear in any later models.
To quote a sentence from the linked article, "In general, if there are p-1 possible candidate predictors, then there are 2p-1 possible regression models containing the predictors. For example, 10 predictors yield 210 = 1024 possible regression models. That's a heck of a lot of models to consider! The good news is that statistical software, such as Minitab, does all of the dirty work for us". I won't be using Minitab, but rather the freeware program OpenStats which fortunately supports the best subsets technique. My research has about 18 different predictors which could mean that I would need 262,144 models! I strongly suspect that several of the predictors which I have listed in my research have little to no correlation with my dependent variable (EUC practice) and so I can discard these predictors before using the best subsets technique. I expect that about six predictors will take part in the final model, which means that there will be 64 possible models.
I read the 'dummies' book on the Kindle and once again came to the conclusion that whilst it's fine to read novels with the Kindle, it's very hard to read learning materials. All the many tables which Rumsey provides are mixed up and unreadable, and I find myself constantly flipping between pages, making it very difficult to maintain a level of understanding. So I have ordered a 'real' copy of the book. Once I have it, I will practice entering some of her data sets into OpenStats.
Not content with this, I have also been reading a book called "Naked statistics", which provides a very understandable introduction to statistics. To make things easier, there aren't many numbers in the book (I suspect that many people get put off by the endless tables of numbers, the raw data for statistical analysis), but rather descriptions. This approach makes concepts such as mean, median, standard deviation and hypothesis testing much easier to understand.
I haven't finished the book yet so I don't know whether it scales the lofty heights of multiple regression (I suspect not).
I would definitely recommend both these book to all DBA students, suggesting that first they read 'Naked statistics' and only then start on 'Intermediate statistics'. To quote part of the introduction, "In hindsight, I now recognise that it wasn't the math that bothered me in calculus class; it was that no one ever saw fit to explain the point of it." This books explains very well the 'point' of statistics without going into the maths (at least, not very deeply).
* For light relief, I've been rereading the late Fred Pohl's "Gateway" series in the past few days.