Wow, English-only people (or Western languages, for that matter) are so naïve. In case you didn't know, the lang attribute is very important in East Asian languages.
-
洪 民憙 (Hong Minhee)replied to Jernej Simončič � last edited by
@jernej__s I'm in favor of Han unification though. See also this:
洪 民憙 (Hong Minhee) (@[email protected])
Well, I vote for Han unification of #Unicode, and I rather think that more Chinese characters should have been unified (e.g., 高 & 髙, 產 & 産, 內 & 内). 🤷 #漢字 #hanzi #hanja #kanji
Fosstodon (fosstodon.org)
-
Jernej Simončič �replied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee But why? If the character looks different, why wouldn't it be represented by a different codepoint?
-
洪 民憙 (Hong Minhee)replied to Jernej Simončič � last edited by
@jernej__s Because they are the same characters, even though they look slightly different. “Unicode encodes characters, not glyphs.” —Unicode FAQ. It's like Arabic numeral 7 is encoded as a single codepoint whether it has an extra horizontal line drawn across it or not.
-
Mikołaj Hołyszreplied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee Serious question. How do platforms that accept user-generated content handle this?
Take Mastodon for example, if three users send a post, one in Chinese, one in Korean, one in Japanese, and the app is international, how would this be handled? How should this be handled?
Are apps targeting the Asian market rewquiring the user to correctly fill in the "language" field each time? Are you effectively required to include AI-based language detection in each product? Are browsers truly unable to figure this out on their own when there's no lang attribute present?
-
洪 民憙 (Hong Minhee)replied to Mikołaj Hołysz last edited by
@miki On the web, it's common to specify the lang attribute in the top-level <html> tag. Internationalized apps will prefer the user's locale setting.
-
@hongminhee
So a Unicode codepoint can correspond to different glyphs in the same font depending on the language? This seems like a big oversight by Unicode, unless it's a conscious decision? -
@thomas It's called Han unification. See also the following thread:
洪 民憙 (Hong Minhee) (@[email protected])
Well, I vote for Han unification of #Unicode, and I rather think that more Chinese characters should have been unified (e.g., 高 & 髙, 產 & 産, 內 & 内). 🤷 #漢字 #hanzi #hanja #kanji
Fosstodon (fosstodon.org)
-
http :verified:replied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee I'm not very familiar with asian languages. But if it's the same word, shouldn't they be written the same? Or if there are clear differences, shouldn't there be different Unicode codepoints?
With different Unicode codepoints, would the attribute still be required? -
洪 民憙 (Hong Minhee)replied to http :verified: last edited by
@http They are just stylistic differences, so it is okay to recognize. However, people tend to find them a kind of awkward.
-
http :verified:replied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee Maybe then they should have an additional attribute, like stylistic style type? Similar to different colors of thumbs up hands. If there's a clear difference, they should have different codes. Bad Unicoding?
-
洪 民憙 (Hong Minhee)replied to http :verified: last edited by
@http There's a something called variation selector, but it's not widely used in East Asia because it's introduced too lately.
-
NiceMicroreplied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee @pavsmith one of the most amusing things getting used to when moving to Korea was the way to write the numbers 1 and 7... before that I wouldn't have guessed that numbers are written differently (and hence confusingly for the outsider) in different cultures.
So it is definitely always a good idea to err on the side of being precise.