Tuesday, January 15, 2013

Creating an email with a subject line in Hebrew

There ain't no book you can read
There ain't nobody to tell you
But I don't think I'm getting
What everybody's getting
Maybe I'm doing it wrong
 (Randy Newman, "Maybe I'm doing it wrong")

After having successfully created emails which display embedded pictures, we then discovered that the subject line of those emails was not always displayed. The test emails which I sent to myself displayed correctly (naturally), but the subject was not readable (only question marks) when the OP accessed the mail on her iPhone.

After rooting around on the Internet, I discovered that the subject line can have its own encoding. Whatever the coding, the line cannot contain characters whose value is over 127. It turns out that most email subject lines in Israel are encoded with the Windows 1255 code page, which is not surprising as this the code page for Israel.

For every encoding (such as Windows-1255 or ISO-8859), there are two possibilities: the line can be 'Quotable' or it can be 'Base64' encoded (Base64 encoding is a system which turns translates byte streams (there may be values over 127) into readable characters (A-Z, a-z, 0-9, +, =).


The above picture show the headers in an Email which I was sent. The part which interests me at the moment is the second line, Subject: =?windows-1255?B?Uk.......?=. Ignoring the 'subject:' part, the line consists of three sections: a prefix (=?windows-1255?B?), a suffix (?=) and a payload, which is everything inbetween.

The prefix tells the email client which type of encoding was used; in this case, it is base64 encoding of Windows-1255. The email client of the person who sent me the email encoded the subject into Base64 ; my email client reads the prefix and then decodes the subject. It's not clear to me at the moment why the majority of emails use B encoding as opposed to Q encoding.

Having learnt this, I 'instructed' my email program to send letters encoded as Windows-1255-Q; this requires taking a Hebrew letter and displaying it as the character equivalent in hex (for example: the letter 'aleph' is represented as character 224 in the code page. 224 is E0 in hex, so the encoding would be E0=). The subjects of letters sent like this were correctly decoded by computers running Outlook, but the iPhone remained stubborn. Presumably the iPhone doesn't know how to decode emails sent in windows-1255 format; emails sent from the iPhone are encoded in UTF-8?B format.

In order to send subjects encoded in UTF8?B format, I have to convert the subject line into UTF-8 encoding and then encode it again to Base-64. After no small amount of fiddling about, I discovered that the Delphi system function AnsiToUTF8 performs the first step correctly (at one stage, I assumed that there was a mistake as the Unicode coding for the letter aleph was not the same as the UTF-8 coding; it transpires that by definition they are not the same), but the Base-64 encoding was producing incorrect results.


After pulling out what is left of my hair and ruining the keyboard with a multitude of google queries, I discovered that there was a bug in the Base-64 implementation that I was using, which came from this page (which is the same as this page). The 'codes64' string begins with the digits, then capital letters, then lower case letters and then punctuation. But this page shows that the capital letters should come first, followed by the lower case letters, followed by the digits, followed by the punctuation. Once I had made this correction, the strings began to be encoded 'correctly' - I could read them properly in my email client, which I am taking to be the definition of 'correct'.

Why the epigram from Randy Newman which opens this posting? Because it seems that the information which I needed does not exist in one place; I had to ferret it out piece by piece and deal with misinformation. At the beginning, almost certainly I was doing it wrong; there was no book to read and no one to tell me. Obviously I'm not getting what every other email client is getting, so clearly I'm doing it wrong.

So: we have slain yet another problem in the path to sending Hebrew emails to a mailing list. Unfortunately, there is still another problem waiting in the wings....

No comments: