Oh hey TIL why the UTF-8 byte order mark is the seemingly random byte sequence . It's simply because that's the UTF-8 encoding of U+FEFF, the code point that's also used as the byte order mark in UTF-16 and UTF-32.

Dave Anderson

Oh hey TIL why the UTF-8 byte order mark is the seemingly random byte sequence <EF BB BF>. It's simply because that's the UTF-8 encoding of U+FEFF, the code point that's also used as the byte order mark in UTF-16 and UTF-32.

UTF-8 has no issues with byte ordering, so the UTF-8 BOM is an oddity that shouldn't be emitted. But it exists and is specified because it's what happens if you take a BOM-ful UTF-16/32 sequence and naively transpose it to UTF-8, the leading U+FEFF BOM becomes <EF BB BF>.

Peter Bindels

@danderson please check your code; 0xFF and 0xFE are impossible in any utf8.

SnowFox

@danderson The best part is that U+FEFF was originally intended *solely* as a byte order mark: https://www.unicode.org/versions/Unicode1.0.0/ch03_6.pdf

It was changed to ZWNBSP in Unicode 1.1 <https://www.unicode.org/versions/Unicode1.1.0/ch03.pdf>, and as you might expect, "character that may or may not be removed when it's at the start of a string" has caused no end of issues.

Dave Anderson

@dascandy Woops, thank you! I'd been staring at hex pairs for so long that I didn't notice the typo despite re-reading it a dozen times.

Dave Anderson

@snowfox Even better, ZWNBSP is now deprecated! As of Unicode 3.2, its use as anything other than a BOM is discouraged, and ZWNBSP's function is strongly recommended to be achieved using U+2060 WORD JOINER instead.

I'm really curious what led to the change in 1.1 though! The change in 3.2 seems to be clearly an acknowledgement of the headaches it caused, but I can't find the reasoning for overloading the BOM character in the first place.