Saturday, March 15, 2014

Boy, was I wrong - programming naivety

It's a well-known fact in computer science that the higher the programming language, the less efficient it is. This is why programs written in assembly language are smaller and faster than those written in a high level language. But also within a high level language, there can be low and high level constructs: the low level construct may require more statements from the programmer but will run faster (because the simple, high level construct hides from a user a multitude of low level constructs).

Let's say that there is a table whose tuples are composed of the following fields

ID - longint (four bytes)
name - varchar (thirty two bytes)
other data - sixteen bytes
comments - varchar (128 bytes)

Each tuple will thus comprise 180 bytes, assuming that there is no overhead. Let's say that there are one hundred tuples in the database -  this is very small, but suitable for a 'master' type table; such a table will probably have more fields but no more tuples. The scenario is that the entire list of tuples is displayed for the user so that she can choose records for updating.

The truly naive programmer will issue a database command - for example "select * from table" - in order to retrieve all the tuples from the database. Result: 180 * 100 = 18,000 bytes are retrieved from the database; in reality, the amount of data will be less as the 'comments' field will not be completely full in every tuple. A more realistic figure would be 116 bytes per tuple, or 11,600 bytes for the table. If the database sits on a network, those 11,600 bytes have to travel along the wires. The truly naive programmer will not be aware - or will ignore - the time necessary for the data to arrive. All he knows is that he can get them whenever he needs, using one small but powerful command, 'open'.

I learned at least fifteen years ago a style of programming with databases which is fairly efficient: this separates the displaying of the table from the editing. For the display, I need only the id and name - thirty six bytes per tuple - so I would request from the database only 3,600 bytes: a third less or three times quicker. The tuple to be edited will be retrieved in its entirety.

Yesterday, I had an epiphany which shows just how naive I am too (or as I prefer to call it, "blinkered"). The epiphany connects to the opening paragraph of this blog and especially to the that one small but powerful command, 'open'. When one opens (or reopens) a database query, all the data has to be retrieved from the database.

What if one is deleting a tuple from the table? I would issue a 'delete' statement with sql which is as efficient as possible, but then I would issue an 'open' command in order to update the display. Why am I requesting all the data which I already have when all I need to do is remove the deleted record???

I use the clientdataset component in Delphi, which is extremely powerful - but apparently only partially understood. This component holds in its buffer all the data retrieved from the database, making various actions, such as on-line sorting, very fast. With regard to the tuple deletion, all I need to do is delete the tuple from the clientdataset - there's no need to retrieve from the database data which I already have in memory!

The extension of this idea hit me only when I was in the shower prior to going to bed (I do some of my best thinking in the shower): when adding or editing a tuple, there is no need to retrieve all the prior existing records; it's enough to retrieve only that tuple then add it manually to the clientdataset. True, this will require more work from me than simply writing 'open' - but I only need do this once, and the user will benefit from this every time a record is added to the table.

The same idea applies to polling - there are applications which I have written which poll (ie access) a database table every five or ten minutes, always clearing the data held in the clientdataset's buffer then retrieving the new data. What if no changes had been made to that table in the intervening five minutes? My original idea was to check the number of records in the table and update only if the number has changed, but now it occurs to me that I only need retrieve records which have been added since the last access and manually add them. This way, I will greatly reduce the amount of network traffic.

Of course, all of the above is correct if only one person is updating the entire database. If there are three or four users (I am thinking of the occupational psychologist's offices), then all the users may have the same table displayed on their screen; in the current scenario, if two people add (or edit) records, then they will see the changes made by the other user (as the data is always reloaded in entirety), whereas the two users who don't make changes won't see updated data - unless they poll. In Priority - and any industrial strength database program - there is an unwritten contract that once data is displayed on the screen, the display will not change, even though there may be changes to the underlying data. So, in a sense, I had been spoiling the OP's workers, at the cost of network traffic. That era has now come to an end.

No comments: