Thursday, April 11, 2013

Speeding up Excel

There is a module in the Occupational Psychologist's management program which lists all the receipts that they have issued. At the moment, there are just over 4,000 lines in the table (and nine columns) - it takes a while for all that data to be returned from the database and displayed on the screen. The other day, the OP asked me to add the possibility of outputting the data to Excel.

Whilst I demurred at the need for such a report, it's not my money paying the programmer's cheque, so I quickly wrote the necessary commands using standard Excel automation code. The  first few program runs took so long that I added a progress bar to the form, which enabled me to see that the program was indeed running. I should note that updating a progress bar 4,000 times also takes time, so I changed the code to update the bar once every 32 records.

I started thinking about how I could improve this. My first idea, interestingly, was to put the Excel code into a separate thread, having learnt how to do this with the emails. Whilst this worked, I found that the program was unresponsive, even though the heavy code was in a separate thread. I read a few articles about this on the Internet, which suggested that it was not possible to put Excel in a separate thread. I also had my doubts about the output.

Casting around for new ideas, I discovered code which output files in Excel's native file format. What could be better than this, I wondered. After playing around a bit, I got the code to work on my computer and shortly I was outputting xls files! At first, I continued with the progress bar, but it was clear that this new version was performing much better, so I removed the bar. 

So pleased was I that I considered replacing all the Excel code in the management program with this new library, but a cooler head prevailed. Let me check first that this code works on the OP's computer: even though I checked it with Excel 2003 and Excel 2007, she has Excel 2013 and who knows what surprises await there. Indeed, when I ran the program on her computer, I received an error message stating that the file had been blocked. I tried setting the trust values in her program, but the hard-to-understand Microsoft Hebrew and the fact that I was running her computer by slow remote control defeated me.

I then checked a few facts and discovered that I had been creating what is called BIFF-5 code, which Excel 2013 disdained to read. BIFF-5 was the primary format for Excel 5 (many years extinct); more modern versions prefer BIFF-8 code whereas Excel 2013 prefers BIFF-12 (although it can read BIFF-8). I searched but was unable to find an implementation of BIFF-8 (let alone BIFF-12) in Delphi. I was rapidly losing hope until I remembered something that I had written in a blog entry of mine from three and a half years ago (amazing how much we can forget):

Today I was reading another of Joel Spolsky's blogs, this time on Office formats. A comment at the end hinted at a new solution: Use a simpler format for writing files. If you merely have to produce Office documents programmatically, there’s almost always a better format than the Office binary formats that you can use which Word and Excel will open happily, without missing a beat.

In the same way that I spent the weekend outputting HTML in order to import it into Word, I could output a file in CSV format (as opposed to BIFF-5) and import that into Excel. I wouldn't need any fancy libraries and could code this very easily. So I did. I also checked that it worked properly with Excel 2013.

I thought it best to take the scientific approach and measure how long it took to activate Excel and pass it the data (either by automation or by file import, including the time needed to create the file). I was not surprised to find that the original automation method required on average 44,420 ticks (about 44 seconds) to handle the 4,000 lines (I thought it took longer). The BIFF-5 code on my computer took a mere 5,300 ticks - 8 times faster! 

So how fast could the csv code be? I couldn't believe my eyes when my measurements showed a mere 530 ticks! I checked that the resulting spreadsheet was correct - it wasn't, I had outputted the file with the 'xls' extension instead of 'csv' so Excel had put everything into one column. That mistake was quickly corrected; when I ran the program again, 563 ticks were required for the correct output. I ran the program four more times with an average execution time of 500 ticks. In other words, this csv code ran 111 times faster than the original code and nearly 11 times faster than the supposedly fast BIFF-5 code!!

Most the Excel code which I write simply outputs values into spreadsheets; after the data has got into the spreadsheet, I set column formats via automation, something which can't be transferred in a csv file. In other words, 99% of the time is spent simply transferring values into the spreadsheet. There are other programs which create fancy graphs so obviously in those cases there is more automation and less data transfer; I imagine that I wouldn't see a hundred fold speed improvement there.

1 comment:

Louis Kessler said...

You are correct. I found the same thing. It is so much faster to stringlist to csv and then import.