The UTF-8 encoded BOM is an offense to engineering.

rob pike

Dave Anderson (@[email protected])

Oh hey TIL why the UTF-8 byte order mark is the seemingly random byte sequence . It's simply because that's the UTF-8 encoding of U+FEFF, the code point that's also used as the byte order mark in UTF-16 and UTF-32. UTF-8 has no issues with byte ordering, so the UTF-8 BOM is an oddity that shouldn't be emitted. But it exists and is specified because it's what happens if you take a BOM-ful UTF-16/32 sequence and naively transpose it to UTF-8, the leading U+FEFF BOM becomes .

Hachyderm.io (hachyderm.io)

Dave Anderson

@robpike Also my first feedback to launching a product as a professional software person. I implemented the google code hosting wiki as an intern, launched it... And Windows users started asking why their wiki pages had leading garbage on them after editing with Wordpad. And so I learned about the UTF-8 BOM, and Microsoft's exciting choices re: its use in software.

rob pike

@danderson In 1993 Nathan Myhrvold visited Bell Labs to give a talk. UTF-8 was too young to have caught on yet, and I wanted to demonstrate it to Nathan to get Microsoft on board. But he wasn't interested, claiming they had their own solution to handling wide characters. He didn't want to, and didn't, hear about UTF-8. It was probably too late to change their plans anyway, but the Windows system interface still disturbs me 30 plus years on.

Dave Anderson

@robpike Reading the Unicode spec on encoding forms, I found some of the wording a bit defensive around pointing out that well, sometimes, in some situations, it's quite possible that UTF-16 may well be preferable to UTF-8. I found it a little odd in the moment, but now I wonder...

Dave Anderson

@robpike I expect I'm projecting, but in a couple of places it seems to me the spec makes too fine a point of restating that it has strictly no opinion between a number of possible choices, and that certainly nobody should be judged for making one set of choices over another.

rob pike

@danderson In the early days of ISO 10646 and also Unicode I believe, there was an assertion that 16 bits was the right answer (note that we ended up at 23 bits, never mind), completely ignoring questions of all that existing 8-bit data, and the horrors of actually dealing with byte order and all the zero bytes that would appear. UTF was proposed as a way to deal with those pesky folks who refused to submit, and then UTF-8 was finally adopted by those for whom reality existed.

rob pike

@danderson I've been pondering putting a talk together to explain the intersection of the varied forces of nationalism, US-centricism, multiple countries promoting their own agendas, the desire to do better, designs proposed by linguists, and the ultimate need for software engineering. It's a good story wherein good things happened, eventually. People who bitch about Unicode without understanding what preceded it all are just being rude or at best wilfully ignorant.

Jon

That'd be a really interesting talk!

@[email protected] @[email protected]

Dave Anderson

@robpike I for one would love to hear that story. I see echoes of it in the spec and technical reports over the years, but it's hard to get a picture of how we got there.

growse ❎

@danderson @robpike I, too would love to hear this.

I'd also be interested in hot takes about how the naming of both unicode and UTF* led to decades of confusion amongst devs about what they actually are....