What’s the difference between unicode and utf8?


This is an unfortunate misnaming perpetrated by Windows.

Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there are ANSI strings (the system codepage on the current machine, subject to total unportability) and there are Unicode strings (stored internally as UTF-16LE).

This was all devised in the early days of Unicode, before we realised that UCS-2 wasn’t enough, and before UTF-8 was invented. This is why Windows’s support for UTF-8 is all-round poor.

This misguided naming scheme became part of the user interface. A text editor that uses Windows’s encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.

(Other editors that do encodings themselves, like Notepad++, don’t have this problem.)

If it makes you feel any better about it, ‘ANSI’ strings aren’t based on any ANSI standard, either.

