Wow, English-only people (or Western languages, for that matter) are so naïve. In case you didn't know, the lang attribute is very important in East Asian languages.

洪民憙 (Hong Minhee)

@jernej__s I'm in favor of Han unification though. See also this:

洪民憙 (Hong Minhee) (@[email protected])

Well, I vote for Han unification of #Unicode, and I rather think that more Chinese characters should have been unified (e.g., 高 & 髙, 產 & 産, 內 & 内). 🤷 #漢字 #hanzi #hanja #kanji

Fosstodon (fosstodon.org)

Jernej Simončič �

@hongminhee But why? If the character looks different, why wouldn't it be represented by a different codepoint?

洪民憙 (Hong Minhee)

@jernej__s Because they are the same characters, even though they look slightly different. “Unicode encodes characters, not glyphs.” —Unicode FAQ. It's like Arabic numeral 7 is encoded as a single codepoint whether it has an extra horizontal line drawn across it or not.

Mikołaj Hołysz

@hongminhee Serious question. How do platforms that accept user-generated content handle this?

Take Mastodon for example, if three users send a post, one in Chinese, one in Korean, one in Japanese, and the app is international, how would this be handled? How should this be handled?

Are apps targeting the Asian market rewquiring the user to correctly fill in the "language" field each time? Are you effectively required to include AI-based language detection in each product? Are browsers truly unable to figure this out on their own when there's no lang attribute present?

洪民憙 (Hong Minhee)

@miki On the web, it's common to specify the lang attribute in the top-level <html> tag. Internationalized apps will prefer the user's locale setting.

? Offline

@hongminhee
So a Unicode codepoint can correspond to different glyphs in the same font depending on the language? This seems like a big oversight by Unicode, unless it's a conscious decision?

洪民憙 (Hong Minhee)

@thomas It's called Han unification. See also the following thread:

洪民憙 (Hong Minhee) (@[email protected])

Well, I vote for Han unification of #Unicode, and I rather think that more Chinese characters should have been unified (e.g., 高 & 髙, 產 & 産, 內 & 内). 🤷 #漢字 #hanzi #hanja #kanji

Fosstodon (fosstodon.org)

http :verified:

@hongminhee I'm not very familiar with asian languages. But if it's the same word, shouldn't they be written the same? Or if there are clear differences, shouldn't there be different Unicode codepoints?
With different Unicode codepoints, would the attribute still be required?

洪民憙 (Hong Minhee)

@http They are just stylistic differences, so it is okay to recognize. However, people tend to find them a kind of awkward.

http :verified:

@hongminhee Maybe then they should have an additional attribute, like stylistic style type? Similar to different colors of thumbs up hands. If there's a clear difference, they should have different codes. Bad Unicoding?

洪民憙 (Hong Minhee)

@http There's a something called variation selector, but it's not widely used in East Asia because it's introduced too lately.

UTS #37: Unicode Ideographic Variation Database

(www.unicode.org)

NiceMicro

@hongminhee @pavsmith one of the most amusing things getting used to when moving to Korea was the way to write the numbers 1 and 7... before that I wouldn't have guessed that numbers are written differently (and hence confusingly for the outsider) in different cultures.

So it is definitely always a good idea to err on the side of being precise.

Wow, English-only people (or Western languages, for that matter) are so naïve. In case you didn't know, the lang attribute is very important in East Asian languages.

洪 民憙 (Hong Minhee) (@[email protected])

洪 民憙 (Hong Minhee) (@[email protected])

UTS #37: Unicode Ideographic Variation Database

洪民憙 (Hong Minhee) (@[email protected])

洪民憙 (Hong Minhee) (@[email protected])