INTRODUCTION
Encoding is something transparent for most users.
It has even become so transparent since the birth of the ingenious UTF-8 Unicode format that even a developer might happen to be quite lost when an incompatibility occurs.
SUMMARY
1. What is encoding?
⇒ Explanation of encodings with a concrete example
2. UTF-8 format
⇒ Focus on the Unicode UTF-8 format that theoretically eliminates any encoding issue
3. Persistent encoding issues
⇒ Why despite the Unicode format, the encodings issues remain relevant?
4. Determine the encoding of a file
⇒ We offer you standard tools to easily determine the encoding of a text
5. Advanced VBA exercise
⇒ A way for creating Unicode files in VBA with or without BOM
Feel free to leave a comment and request for support on a particular encoding problem.
1. WHAT IS ENCODING?
A string is not stored in memory as a string but rather as 0s and 1s in binary.
The most readable representation for us of this binary code is the hexadecimal code where each byte represents a single character in ASCII or in extended ASCII.
Example :
The following string is encoded with the “Windows-1252” code:
“L’expérience est le nom que chacun donne à ses erreurs.” Oscar Wilde
In hexadecimal code it is represented as shown below :
22 4C 27 65 78 70 E9 72 69 65 6E 63 65 20 65 73 | “L’expérience es |
74 20 6C 65 20 6E 6F 6D 20 71 75 65 20 63 68 61 | t le nom que cha |
63 75 6E 20 64 6F 6E 6E 65 20 E0 20 73 65 73 20 | cun donne à ses |
65 72 72 65 75 72 73 2E 22 20 4F 73 63 61 72 20 | erreurs.” Oscar |
57 69 6C 64 65 | Wilde |
The 7th character “é” is stored in memory using the following hexadecimal value: “E9” .
In the character table Windows-1252, the code “E9” corresponds to the French character “é” :
Windows-1252 (CP1252) | ||||||||||||||||
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | xA | xB | xC | xD | xE | xF | |
0x | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
1x | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
2x | SP | ! | “ | # | $ | % | & | ‘ | ( | ) | * | + | , | – | . | / |
3x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
4x | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
5x | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
6x | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
7x | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
8x | € | ‚ | ƒ | „ | … | † | ‡ | ˆ | ‰ | Š | ‹ | Œ | Ž | |||
9x | ‘ | ‘ | “ | ” | • | – | — | ˜ | ™ | š | › | œ | ž | Ÿ | ||
Ax | NBSP | ¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | ® | ¯ | |
Bx | ° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹ | º | » | ¼ | ½ | ¾ | ¿ |
Cx | À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï |
Dx | Ð | Ñ | Ò | Ó | Ô | Õ | Ö | × | Ø | Ù | Ú | Û | Ü | Ý | Þ | ß |
Ex | à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï |
Fx | ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
Still, if you go into an MS-DOS window, the hexadecimal code “E9” will not display correctly!
Indeed, a French MS-DOS window would show the following:
Cd \temp
Type test.txt
“L’expÚrience est le nom que chacun donne Ó ses erreurs.” Oscar Wilde
This is simply because MS-DOS considers by default that texts (on a French computer) are encoded using the page 850 below:
page 850 (DOS latin-1) | ||||||||||||||||
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | xA | xB | xC | xD | xE | xF | |
0x | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
1x | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
2x | SP | ! | “ | # | $ | % | & | ‘ | ( | ) | * | + | , | – | . | / |
3x | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
4x | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
5x | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
6x | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
7x | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
8x | Ç | ü | é | â | ä | à | å | ç | ê | ë | è | ï | î | ì | Ä | Å |
9x | É | æ | Æ | ô | ö | ò | û | ù | ÿ | Ö | Ü | ø | £ | Ø | × | ƒ |
Ax | á | í | ó | ú | ñ | Ñ | ª | º | ¿ | ® | ¬ | ½ | ¼ | ¡ | « | » |
Bx | ░ | ▒ | ▓ | │ | ┤ | Á | Â | À | © | ╣ | ║ | ╗ | ╝ | ¢ | ¥ | ┐ |
Cx | └ | ┴ | ┬ | ├ | ─ | ┼ | ã | Ã | ╚ | ╔ | ╩ | ╦ | ╠ | ═ | ╬ | ¤ |
Dx | ð | Ð | Ê | Ë | È | ı | Í | Î | Ï | ┘ | ┌ | █ | ▄ | ¦ | Ì | ▀ |
Ex | Ó | ß | Ô | Ò | õ | Õ | µ | þ | Þ | Ú | Û | Ù | ý | Ý | ¯ | ´ |
Fx | SHY | ± | ‗ | ¾ | ¶ | § | ÷ | ¸ | ° | ¨ | · | ¹ | ³ | ² | ■ | NBSP |
The hexadecimal code “E9” matches the character “Ú” of the page 850 character list and does not match the character “é” as we might have expected.
It is thus understood that a text file is in fact a coded message (not encrypted message) that should be decoded using a precise translation table.
The difficulty is therefore twofold:
– You must carefully choose, depending on the target application, the encoding to use when saving a text.
– When it is time to display a text, it is necessary to be able to determine the encoding that was used.
The INVIVOO blog is using UTF-8 encoding. But if you were having fun forcing the use of the ISO-8859-7 encoding in Internet Explorer using the menu “View => Encoding => More => Greek (ISO)” then the below windows-1252 characters would be wrongly displayed.
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | xA | xB | xC | xD | xE | xF | |
Fx | ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
Using ISO-8859-7, the above characters would be displayed as the below characters :
x0 | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | xA | xB | xC | xD | xE | xF | |
Fx | Γ° | Γ± | Γ² | Γ³ | Γ΄ | Γ΅ | ΓΆ | Γ· | ΓΈ | ΓΉ | ΓΊ | Γ» | ΓΌ | Γ½ | ΓΎ | ΓΏ |
We can see that the expected Western characters are now displayed badly and that there are 2 characters instead of only one.
This is because in UTF-8 Unicode encoding Western special characters are all double-byte encoded. And because the ISO-8859-7 (Greek) encoding considers that each of these two bytes is a character in itself in its mapping table.
Note that the number of existing encodings is quite large. Each one have a consistent reason to be once you know the History of Encoding. Following the internationalization coming with the Internet development, encoding management became more and more complex because of multilingual environments.
Fortunately, The Unicode standard has successfully passed the challenge to gather every single characters from every encodings in one and only character tables: the Unicode character list.
2. UTF-8 FORMAT
The Unicode format was born from the desire to unify the multitude of existing encoding. The multiplicity of the codes was necessary because the systems always considered that to one byte corresponded a single character. However, on a byte only 256 characters can be coded. Some languages therefore needed their own extended ASCII code. The French “é” for example is not at all useful for Greeks who also need to code a whole alphabet of their own.
The Unicode solution is to get rid of the single byte limitation in order to have an almost infinite amount of possible characters. Asian characters can be encoded for example using 4 bytes.
Unfortunately there are several versions of the Unicode format depending on whether the number of bytes per character is fixed or dynamic, and depending on the order of reading bytes.
We will only present the UTF-8 format because it tends to dominate because of its efficiency in terms of memory size and because of its backward compatibility with ASCII. Indeed, nothing distinguishes an old ASCII file from a UTF-8 file. Only when special characters are used will the UTF-8 file be distinguished from ASCII.
Special characters in UTF-8 are be stored in hexadecimal from 2 to 4 bytes. It is encoded simply by respecting the UTF-8 character map. Somehow, it’s still as simple as it used to be: A “code” is always a single character in the character mapping.
If an application can not read the UTF-8 or if it is forced in extended ASCII (as in our previous example on forcing in ISO-8859-7 in internet explorer) then the application will read each byte as one distinct character. However, the Western Europe special characters are all coded on 2 bytes in UTF-8.
=> It is for this reason that the accentuated characters are displayed on 2 characters instead of only one when the encoding is badly determined.
You now know almost everything about UTF-8:
– It has backward compatibility with ASCII
– Special characters are stored on 2 to 4 bytes
– Like any encoding, the application that “reads” the hexadecimal code must use the correct encoding
You only know “almost everything” because there is a specific feature in Unicode that still raises some compatibility issues: the BOM (Byte Mark Order). We will talk about it in the next part.
3. PERSISTANT ENCONDING ISSUES
After this necessary presentation of text encoding, we finally get to main point of this article through the following question:
If UTF-8 has all the characters and can replace all the codes, why do we still encounter encoding issues ???
1. Change takes time
The main reason is that old systems did not necessarily evolve at the same time as the Unicode revolution. Thus, there may be some databases or applications or packages might have been programmed to receive a specific encoding and often enough they might be expecting a single byte per character.
2. The specificity of Microsoft Windows
Microsoft took the liberty of creating its own character tables derived from ISO-8859-x tables. Moreover, it is impossible to know whether a text uses an ISO table or Windows table because they both correspond just to a sequence of bytes.
This freedom taken by Microsoft would be less of a problem if Windows applications used UTF-8 by default but this is not the case. As long as there is no special character outside the Windows-1252 table, most Windows applications do not encode texts using UTF-8.
Thus, sending a Windows text file to a linux server or a proprietary application can easilylead to confusion.
3. Character fonts
Because Unicode can encode all possible characters, it has become a nightmare for artists creating fonts because redrawing each character is an huge task. And they do not do it so that they can focus on the language which they are interested in. In addition, the Unicode standard can add new characters to the table and existing fonts become incomplete!
As a result, for exotic languages, it may be necessary to work with specific fonts.
However, the fonts affect only the display for end users and in no way it would disrupt your the processing or the database storage of your strings.
3. The BOM (Byte Order Mark)
The Byte Order Mark is a sequence of unprintable Unicode bytes placed at the beginning of a Unicode text to facilitate its interpretation. This Byte Order Mark is neither standard nor mandatory but it makes it easier for compatible applications to determine the subtype of Unicode format and to define the direction for reading the bytes.
This often causes compatibility issues because not all applications know how to handle the “BOM”. For non-compatible applications, this sequence of bytes is considered as some normal characters in extended ASCII. In the case of a UTF-8 file wrongly recognized as a Windows-1252 file, we would see 3 strange characters at the very beginning of the file: .
The characters  correspond to the hexadecimal string EF BB BF which is a code to indicate to compatible applications that the file is a Unicode file in UTF-8 format.
Another problem of the BOM is the confusion it can bring to a user. EF BB BF corresponds to some non-printable characters in UTF-8. So, in a Unicode text editor, it is difficult to know if the BOM has been applied or not since it is invisible and also optional in a UTF-8 file. Many users most likely do not know what the BOM is and how it can crash non-compatible applications.
Et pire, il existe plusieurs BOM possibles pour indiquer des formats Unicode différents de UTF-8 et généralement encore moins compatibles.
Since the BOM is invisible to the user, the confusion is obvious and inevitable.
However, in section below, we will provide you with standard tools so that you can quickly determine if your file is as you expect it to be.
4. DETERMINE THE ENCODING OF A FILE
Regardless of the origin of a file, whether generated automatically, or sent by a data provider, or built manually, it may be useful to verify with absolute certainty its format and to show the possible tag BOM.
If you do not have access to advanced (and usually paid) text editors, you can easily do this with standard hexadecimal editors on Windows and linux.
On Windows:
– Key windows+R
– Command powershell
– Cd \temp
– Fhx test.txt
On Linux :
cd /home/test/
file -bi test.txt
=> Linux will “try ” to show the format of the file but if you want to see the BOM tag, it is necessary to type the following:
xxd test.txt
In case there is a BOM tag at the very beginning of the file then it is a text using the Unicode format:
UTF-8 = EF BB BF
UTF-16 Big Endian = FE FF
UTF-16 Little Endian = FF FE
UTF-32 Big Endian = 00 00 FE FF
UTF-32 Little Endian = FF FE 00 00
Above all, remember that the absence of the BOM tag does not mean that the file is not a Unicode file.
Indeed, on the contrary, it may be necessary to remove it to increase compatibility with your applications.
In the next part, we’ll see how to remove BOM in VBA in order to avoid crashing downstream applications.
5. ADVANCED VBA EXERCICE
When you create a Unicode file using VBA macros destined for format-sensitive applications, you will probably encounter some difficulties in mastering the BOM.
For a start, you can use the commands suggested in the previous section to check your output files.
To create UTF-8 files at your convenience – with or without Byte Order Mark – you need to know the following limitations of VBA:
1. The Print #1 command does not save in UTF-8, you would lose your Unicode characters
Example:
Open “c:\Temp\test.txt” For Output As #1
Print #1, “Ligne 1 : A very special Unicode character : Ж = D0 96”
2. The SavetoFile command from the “ADODB.Stream” object always creates a BOM “EF BB BF” on UTF-8 files! Do not search to much: there is no option to write a UTF-8 without BOM but we will give you a solution.
Knowing these two limitations will save you a lot of research.
Below is a sample code that allows you to create two files: on with a BOM “EF BB BF” and another without BOM.
Sub Create_UTF8()
Dim lStreamUTF8BOM, lStreamBinaireSansBOM As Object
Set lStreamUTF8BOM = CreateObject(“ADODB.Stream”)
Set lStreamBinaireSansBOM = CreateObject(“ADODB.Stream”)
‘We create the master Stream
lStreamUTF8BOM.Type = 2 ‘2 = Type Texte
lStreamUTF8BOM.Mode = 3 ‘3 = Mode Read and Write
lStreamUTF8BOM.Charset = “UTF-8” ‘Unicode UTF-8 format with BOM
lStreamUTF8BOM.Open
lStreamUTF8BOM.WriteText “Ligne 1 : A very special Unicode character : Ж = D0 96” & vbCrLf
lStreamUTF8BOM.WriteText “Ligne 2” & vbCrLf
‘saving as UTF-8 with BOM
lStreamUTF8BOM.SaveToFile “c:\Temp\UTF8withBOM.txt”, 2 ‘2 = overwrite
‘saving as UTF-8 without BOM
lStreamBinaireSansBOM.Type = 1 ‘1 = binary stream
lStreamBinaireSansBOM.Mode = 3 ‘3 = Mode Read and Write
lStreamBinaireSansBOM.Open
lStreamUTF8BOM.Position = 3
lStreamUTF8BOM.CopyTo lStreamBinaireSansBOM
lStreamBinaireSansBOM.SaveToFile “c:\Temp\UTF8withoutBOM.txt”, 2 ‘2 = overwrite
lStreamBinaireSansBOM.Flush
lStreamBinaireSansBOM.Close
lStreamUTF8BOM.Flush
lStreamUTF8BOM.Close
End Sub
To check the results, you can open the files in C:\TEMP\ using Powershell and the fhx command as shown in the previous section.
CONCLUSION
We have showed you how encoding works and what the UTF-8 revolution has brought.
However encoding incompatibilities may persist between applications, so we have offered you some tools to check the format of your files as well as to consult the invisible Byte Order Mark. In addition, you now know how to create some UTF-8 files with or without BOM via VBA.
You now have all the tools to help diagnose your possible encoding issues and you can do it using only some standard tools!