Wow, English-only people (or Western languages, for that matter) are so naïve. In case you didn't know, the lang attribute is very important in East Asian languages.
-
concept of a display namereplied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee it’s not trivial to determine per se, but a cross-entropy classifier on character bigrams (that is, 1990s NLP) is surprisingly accurate at determining the language of a string.
However—and this is the big caveat—it’s only trivial if (1) you know where the language boundaries are and (2) the string is long enough to get robust bigram statistics.
Even if you weren’t to specify the language, “lang” solves problem (1) readily.
-
洪 民憙 (Hong Minhee)replied to concept of a display name last edited by
@thedansimonson Yeah, but East Asian languages often be too short, e.g., 孤立無援, which is a valid sentence in Korean, Chinese, and Japanese.
-
concept of a display namereplied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee oh yea totally—but is it common to mix strings of that length on a website containing multiple languages? I’d suspect generally, where things get mixed up, the shortest you might have is a link or button.
-
洪 民憙 (Hong Minhee)replied to concept of a display name last edited by
@thedansimonson You're right, the shortest ones are buttons or links in a navigation bar.
-
James Woodreplied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee How accurately are `lang` attributes placed in practice? I remember seeing “直” displayed wrong for the intended language on social media before (and, by the way, I don't think it's possible for me to specify the intended language in my quote there), and I often see people on the Fediverse who set-and-forget their language and then post in a different language, which you can tell on the client I use because it offers to translate the message.
-
洪 民憙 (Hong Minhee)replied to James Wood last edited by
@mudri Yes, in practice, people often don't even specify the lang attribute at all, and as you said, even on fediverse, there are many people who post without setting the language correctly.
-
@hongminhee 근데 정작 한국 웹도 이걸 신경쓰지 않는 것 같고... 일본제인 미스크도 이를 붙여주는 기능을 제공하지 않죠.
한국쪽 마스토돈 사용자도 언어 설정에 그렇게 신경쓰지 않는 듯 하고요.
-
@ssharp 맞습니다…
-
Jernej Simončič �replied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee I still think it's stupid that Unicode hasn't separated the different-looking CJK glyphs into separate codepoints. If we can have A, Α, А and A as separate characters, why couldn't that have been done for CJK?
-
洪 民憙 (Hong Minhee)replied to Jernej Simončič � last edited by
@jernej__s I'm in favor of Han unification though. See also this:
洪 民憙 (Hong Minhee) (@[email protected])
Well, I vote for Han unification of #Unicode, and I rather think that more Chinese characters should have been unified (e.g., 高 & 髙, 產 & 産, 內 & 内). 🤷 #漢字 #hanzi #hanja #kanji
Fosstodon (fosstodon.org)
-
Jernej Simončič �replied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee But why? If the character looks different, why wouldn't it be represented by a different codepoint?
-
洪 民憙 (Hong Minhee)replied to Jernej Simončič � last edited by
@jernej__s Because they are the same characters, even though they look slightly different. “Unicode encodes characters, not glyphs.” —Unicode FAQ. It's like Arabic numeral 7 is encoded as a single codepoint whether it has an extra horizontal line drawn across it or not.
-
Mikołaj Hołyszreplied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee Serious question. How do platforms that accept user-generated content handle this?
Take Mastodon for example, if three users send a post, one in Chinese, one in Korean, one in Japanese, and the app is international, how would this be handled? How should this be handled?
Are apps targeting the Asian market rewquiring the user to correctly fill in the "language" field each time? Are you effectively required to include AI-based language detection in each product? Are browsers truly unable to figure this out on their own when there's no lang attribute present?
-
洪 民憙 (Hong Minhee)replied to Mikołaj Hołysz last edited by
@miki On the web, it's common to specify the lang attribute in the top-level <html> tag. Internationalized apps will prefer the user's locale setting.
-
@hongminhee
So a Unicode codepoint can correspond to different glyphs in the same font depending on the language? This seems like a big oversight by Unicode, unless it's a conscious decision? -
@thomas It's called Han unification. See also the following thread:
洪 民憙 (Hong Minhee) (@[email protected])
Well, I vote for Han unification of #Unicode, and I rather think that more Chinese characters should have been unified (e.g., 高 & 髙, 產 & 産, 內 & 内). 🤷 #漢字 #hanzi #hanja #kanji
Fosstodon (fosstodon.org)
-
http :verified:replied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee I'm not very familiar with asian languages. But if it's the same word, shouldn't they be written the same? Or if there are clear differences, shouldn't there be different Unicode codepoints?
With different Unicode codepoints, would the attribute still be required? -
洪 民憙 (Hong Minhee)replied to http :verified: last edited by
@http They are just stylistic differences, so it is okay to recognize. However, people tend to find them a kind of awkward.
-
http :verified:replied to 洪 民憙 (Hong Minhee) last edited by
@hongminhee Maybe then they should have an additional attribute, like stylistic style type? Similar to different colors of thumbs up hands. If there's a clear difference, they should have different codes. Bad Unicoding?
-
洪 民憙 (Hong Minhee)replied to http :verified: last edited by
@http There's a something called variation selector, but it's not widely used in East Asia because it's introduced too lately.