Not sure if the l33tspeak analogy is fully justified. In case of the "missing" l...

chimeracoder · on March 17, 2015

> I could write the author's name fine: আদিত্য

Author here.

Well, yes and no. The jophola at the end is not actually given its own codepoint[0]. The best analogy I can give is to a ligature in English[1]. The Bengali fonts that you have installed happen to render it as a jophola, the way some fonts happen to render "ff" as "ﬀ" but that's not the same thing as saying that it actually is a jophola (according to the Unicode standard).

The difference between the jophola and an English ligature, though, is English ligatures are purely aesthetic. Typing two "f" characters in a row has the same obvious semantic meaning as the ﬀ ligature, whereas the characters that are required to type a jophola have no obvious semantic, phonetic, or orthographic connection to the jophola.

[0] http://unicode.org/charts/PDF/U0980.pdf

[1] Some fonts will render (e.g.) two "f"s in a row as if they were a ligature, even though it's not a true ﬀ(U+FB00).

sdg1 · on March 17, 2015

> The Bengali fonts that you have installed happen to render it as a jophola

It's not only the Bengali font - the text rendering framework of my operating system also needs to have a bunch of complex rules to figure out that a jophola needs to be rendered. It also needs to know that the visual ordering of i-kar is before the preceding consonant cluster (দ in আদিত্য).

> the characters that are required to type a jophola have no semantic, phonetic, or orthographic connection to the jophola.

Not so sure about that. The fact that it's called a "jo"-phola points to a relationship. The relationship may have become less apparent as the script has evolved (though there are words such as সহ্য which makes the relationship more visible), but the distinction is still not as pronounced as between "ta" and "khanda-ta". For the khanda-ta case, it was explicit from the the then-current editions of the dictionaries produced by the language bodies of both Bangladesh and West Bengal that the character had become distinct (স্বতন্ত্র বর্ণ was the phrase that was used). As far as I know, there hasn't been any such claim about jophola from the language bodies. Also, if you look at the collation in Bengali dictionaries, jo-phola is treated as (্+য) for collation.

chimeracoder · on March 17, 2015

Is Bengali your first language?

While one can make the case that ত্য is simply "'to' - 'o' + 'ya' = 'to'"[0][1], it's rather confusing mental acrobatics, and it doesn't reflect either how the writing system is taught, or how native speakers use it and think of it on a day-to-day basis.

If anything, your comment makes a stronger argument for consolidating ই and ি (they are literally the same letter and phoneme, but written differently in different contexts) than for combining the viram and য into the jophola.

[0] To non-Bengali speakers reading this, yes, this is how that construction would work, and yes, I am aware that the arithmetic doesn't appear to add up (which I guess is part of the point).

[1] Also, now that I think about it, the য is a consonant, not a vowel, so using it in place of a vowel is doubly awkward. This is particularly an issue in Bengali, where sounds that might be consonants in English (like "r" and "l") can be either consonants or vowels in Bengali, depending on the word.

sdg1 · on March 17, 2015

> Is Bengali your first language?

Yes.

> [...] it's rather confusing mental acrobatics, and it doesn't reflect either how the writing system is taught, or how native speakers use it and think of it on a day-to-day basis.

Mental acrobatics are part-and-parcel of the language, either in digital or non-digital form. If I were to spell out your name aloud, I would end with "ত-এ য-ফলা", which doesn't really say anything about how ত্য is pronounced. While writing on paper, we say "ক-এ ইকার", and then we reorder what we just said to write the ইকার in front of the ক. Even more complicated mental acrobatics - we say ক-এ ওকার, and then proceed to write half of the ওকার in front of the ক and then the other half, after the ক. We don't necessarily think about these when we carry out these acrobatics in our head, but they exist, and we have made the layer on top of the encoding system (rendering, and to some extent, input) deal with these acrobatics as well. My point in the original comment (and to some extent in the preceding one) was to emphasize that a lot of these issues are at the input method level - we should not have to think about encoding as long as it accurately and unambiguously represent whatever we want it to represent.

Just out of curiosity - I would be interested to know more about your learning experience that you feel is not well aligned with the representation of jophola as it is currently.

chimeracoder · on March 17, 2015

> My point in the original comment (and to some extent in the preceding one) was to emphasize that a lot of these issues are at the input method level - we should not have to think about encoding as long as it accurately and unambiguously represent whatever we want it to represent.

I might be sympathetic to this, except that keyboard layouts and input (esp. on mobile devices) is an even bigger mess and even more fragmented than character encoding. Furthermore, while keys are not a 1:1 mapping with Unicode codepoints, they are very strongly influenced by the defined codepoints.

It'd be nice to separate those two components cleanly, but since language is defined by how it's used, this abstraction is always going to be very porous.

> Just out of curiosity - I would be interested to know more about your learning experience that you feel is not well aligned with the representation of jophola as it is currently.

I have literally never once heard the jophola referred to as a viram and a য, except in contexts such as this one. Especially since the jophola creates a vowel sound (instead of the consonant য), and especially since the jophola isn't even pronounced like "y". (I understand why the jophola is pronounced the way it does, but arguing on the basis of phonetic Sanskrit is a poor representation of Bengali today - by that point, we might as well be arguing that the thorn[0] is equivalent to "th" today, or that æ should be a separate letter and not a dipthong[1])

I'm not going to claim that it's completely without pre-digital precedent, but it certainly is not universal, and it's inconsistent in one of the above ways no matter how one slices it, especially when looking at some incredibly obscure and/or antiquated modifiers that are given their own characters, despite being undeniably joined of other characters that are already Unicode codepoints[2].

(Out of curiosity, where in Bengal are you from?)

[0] https://en.wikipedia.org/wiki/Thorn_%28letter%29

[1] As it was in Old English.

[2] Such as the ash (æ)!

plesner · on March 17, 2015

Not necessarily disagreeing with your broader point but I just want to point out that the examples are only obscure and antiquared in English. æ is common in modern Danish and unambiguously a separate letter, as is þ in Icelandic.

e12e · on March 18, 2015

And Norwegian (for æ). We have æ/Æ and ø/Ø (distinct, in theory, from the symbol for the empty set, btw), while å/Å used to be written aa/AA a long time ago (but is obviously not a result of combining two a's). Swedish uses ö/Ö for essentially ø/Ø, ä/Ä for æ/Æ. And both of those I can only easily type by combining the dot-dot, with o/O, a/A because my Norwegian keyboard layout has key for/labelled øæå, not öäå.

For Japanese (and Chinese and a few others) things are even more complicated. It's tricky to fit ~5000 symbols on a keyboard, so typically in Japan one types on either a phonetic layout, or a latin layout, and translate to Kanji as needed (eg: "nihongo" or "にほんご" is transformed to "日本語" -- note also that "ご" itself is a compound character, "ko" modified by two dots to become "go" -- which may or may not be entered as a compound, with a modifier key).

As I currently don't have any Japanese input installed under xorg, that last bit I had to cut and paste.

It is entirely valid to view ø as a combination of o and a (short) slash, or å as a combination of "a" and "°" -- but if one does that while typing, it is important that software handles the compound correctly (and distinct from ligatures, as mentioned above). My brother's name, "Ståle" is five letters/symbols long, if reversed it becomes "elåtS", not "el°atS" (six symbols).

So, yeah, it's complicated. Remember that we've spent many years fighting the hack that was ascii, extended ascii (which may be (part of) why eg: Norwegian gets to have å rather than a+°). You still can't easily use utf8 with neither C nor, as I understand it C++ (almost, but not quite -- AFAIK one easy workaround is to use QT's strings if one can have qt as a dependency -- and it's still a mess on Windows, due to their botched wide char hacks... etc).

All in all, while it's nice to think that one can take some modernized, English-centric ideas evolved from the Gutenberg press, and mash it together with what constitutes a "letter" (How hard is it to reverse a string!? How hard is it to count letters!? How hard is it to count words?!) -- that approach is simply wrong.

There will always be magic, and there'll be very few things that can be said with confidence to be valid across all locales. What is to_upper("日本語"), reverse("Ståle"), character_count("Ståle"), word_count("日本語") etc.

This turned into a bit more of an essay than I intended, sorry about that :-)

jdmichal · on March 18, 2015

To be fair, proper codepoint processing is a pain even in Java, which was created back when Unicode was in 16-bit mode. Now that it's extended to 32-bits, proper Unicode string looping looks something like this:

    for(int i = 0; i < string.length();) {
        final int codepoint = string.codePointAt(i);
        i += Character.charCount(codepoint);
    }

elFarto · on March 18, 2015

Actually, that's not correct, and it's the exact same mistake I made when using that API. codePointAt returns the codepoint at index i, where i is measured in 16-bit chars, which means you could index into the middle of a surrogate pair.

The correct version is:

  for (int i = 0; i < string.length(); i = string.offsetByCodePoints(i, 1))
  {
     int codepoint = string.codePointAt(i);
  }

Java 8 seems to have acquired a codePoints() method on the CharSequence interface which seems to do the same thing.

But this just adds to the fact, proper Unicode string processing is a pain :).

jdmichal · on March 18, 2015

I think you missed the part where `i` is not incremented in the for statement, but inside the loop using `Character.charCount`, which returns the number of `char` necessary to represent the code point. If there's something wrong with this, my unit tests have never brought it up, and I am always sure to test with multi-`char` codepoints.

elFarto · on March 18, 2015

Your right, I did miss it, I apologize. That'll teach me to read code so early in the morning.

sdg1 · on March 17, 2015

> except that keyboard layouts and input (esp. on mobile devices) is an even bigger mess and even more fragmented than character encoding.

Encoding was in a similar place 10-15 years ago. Almost every publisher in Bengali had their own encoding, font, and keyboard layout - the bigger ones built their own in-house systems, while the smaller ones used systems that were built or maintained by very small operators. To make things even more complicated, these systems needed a very specific combination of operating system and page layout software to work. Now the situation is quite better with most publishers switching to Unicode, at least for public facing content.

With input methods, I expect to see at least some consolidation - I don't necessarily think we need standards here, but there will clear leaders that emerge. Yes, keyboard layouts are influenced by Unicode code-points, but only in a specific context. Usually when people who already have experience with computers start to type in Bengali (or any other Indic language), they use a phonetic keyboard, which is influenced mostly by the QWERTY layout. Then, if they write a significant amount, they find that the phonetic input is not very efficient (typing kha everytime to get খ is painful), and they switch to a system where there's a one-to-one mapping between commonly used characters and keys. This does tend to have a relationship between defined codepoints and keys, but that's probably because the defined codepoints cover the basic characters in the script (so in your case, ্য would need to have a separate key, which I think is fine). There will still be awkward gestures, but that's again, a part of adjusting to the new medium. No one bats an eyelid when hitting "enter" to get a newline - but when we learn to write on paper, we never encounter the concept of a carriage-return.

> I have literally never once heard the jophola referred to as a viram and a য

Interesting - I guess we have somewhat different mental models. For me, I did think of jophola as a "hoshonto + jo", possibly because of the "jo" connection, and this was true even before I started to mess around with computers or Unicode. I always thought about jophola as a "yuktakshar", and if it's a "yuktakshar", I always mentally broke it down to its constituents.

> [...] especially when looking at some incredibly obscure and/or antiquated modifiers that are given their own characters

I think those exist because of backwards compatibility reasons. For Bengali I think Unicode made the right choice to start with the minimum number of code points (based on what ISCII had at that time). As others have pointed out elsewhere in the thread - it is an evolving standard, and additions are possible. Khanda-ta did get accepted, and contrary to what many think, non-consortium members can provide their input (for example, I am acknowledged in the khanda-ta document I linked to earlier, and all I did was participate in the mailing list and provide my suggestions and some evidence).

> Out of curiosity, where in Bengal are you from?

কলকাতা

garfij · on March 18, 2015

This has been a fascinating back and forth. Thank you for taking the time to have a comprehensive discussion.

SiVal · on March 18, 2015

Is Bengali your first language?

A better question is, Are there any native Bengali speakers creating character set standards in Bangladesh or India? If not, why not? If so, did they omit your character?

I ask, because although you prefer to follow the orthodox pattern of blaming white racism for your grievance du jour, the policy of the Unicode Technical Committee for years has been to use the national standards created by the national standards bodies where these scripts are most used as their most important input.

Twenty years ago, I spent a lot of time in these UTC meetings, and when the question arose as to whether to incorporate X-Script into the standard yet, the answer was never whether these cultural imperialists valued, say, Western science fiction fans over irrelevant foreigners, but it was always, "What is the status of X-Script standardization in X-land?" Someone would then report on it. If there was a solid, national standard in place, well-used by local native speakers in local IT applications, it would be fast-tracked into Unicode with little to no modification after verification with the national authorities that they weren't on the verge of changing it. If, however, there was no official, local standard, or several conflicting standards, or a local standard that local IT people had to patch and work around, or whatever, X-Script would be put on a back burner until the local experts figured out their own needs and committed to them.

The complaint in this silly article about tiny Klingon being included before a complete Bengali is precisely because getting Bengali right was more complex and far more important. Apparently, the Bengali experts have not yet established a national standard that is clear, widely implemented, agreed upon by Bengali speakers and that includes the character the author wants in the form he/she wants it, for which he/she inevitably blames "mostly white men."

(Edited to say "he/she", since I don't know which.)

Manishearth · on March 18, 2015

I mostly agree with your point, but note that the author is male (well, the name is a commonly male one).

It's a bit telling that folks in the software industry[1] seem to assume that techies are male (a priori), but those who write articles of this kind are female.

Not blaming you for it, but it's something you should try to be conscious about and fix.

[1] I've been guilty of this myself, though usually in cases where I use terms like "guys" where I shouldn't be.

SiVal · on March 18, 2015

I had a female coworker by that name, so your assumption that I just assume that people who write articles like this are female and need to have my consciousness raised to "fix" my unconscious sexism is something you should try to be more conscious of and try to fix.

However, I clearly do need to question my assumption that since this was a female name before, it's a female name now, so I should change it to "he/she".

jedanbik · on March 22, 2015

Or they.

Manishearth · on March 18, 2015

> I had a female coworker by that name

Oh, sorry about that. Not sure if you're joking about the assumption of assumptions, but asking people to take note of their behavior based on something that they _might_ have assumed is not dangerous. Assuming gender roles is. Apologies for making that assumption, but IMO it's a rather harmless one so I don't see anything to fix about it :P

chimeracoder · on March 18, 2015

> The complaint in this silly article about tiny Klingon being included before a complete Bengali is precisely because getting Bengali right was more complex and far more important.

This is factually incorrect. It seems you missed both the factual point about the Klingon script in the article as well as the broader point which that detail was meant to illustrate.

> although you prefer to follow the orthodox pattern of blaming white racism for your grievance du jour, the policy of the Unicode Technical Committee for years has been to use the national standards created by the national standards bodies where these scripts are most used as their most important input.

There's a huge difference between piggybacking off of a decades-old proposed scheme which was never widely adopted even in its country of origin, and which was created under a very different set of constraints than Unicode, and which was created to address a very different set of goals than Unicode, versus making native speakers an active and equal part of the actual decision-making process.

Rather than trying to shoehorn the article into a familiar pattern which doesn't actually fit ("orthodox pattern of blaming white racism for your grievance du jour"), please take note that the argument in the article is more nuanced than you're giving it credit for.

SiVal · on March 18, 2015

versus making native speakers an active and equal part of the actual decision-making process.

As I explained, native speakers are the primary decision makers, and not just any native speakers but whoever the native speakers choose as their own top, native experts when they establish their own national standard. For living, natural languages, you don't get characters into Unicode by buying a seat on the committee and voting for them. You do it by getting those characters into a national standard created by the native-speaking authorities.

So, I repeat: What national standard have your native-speaking authorities created that reflects the choices you claim all native speakers would naturally make if only the foreign oppressors would listen to them? If your answer is that the national standards differ from what you want, then you are blaming the Unicode Technical Committee for refusing to override the native speakers' chosen authorities and claiming this constitutes abuse of native Bengali speakers by a bunch of "mostly white men".

chimeracoder · on March 18, 2015

> As I explained, native speakers are the primary decision makers

No, the ultimate decision makers of Unicode are the voting members of the Unicode Consortium (and its committees).

> For living, natural languages, you don't get characters into Unicode by buying a seat on the committee and voting for them. You do it by getting those characters into a national standard created by the native-speaking authorities

As referenced elsewhere in the comments, there are plenty of decisions that the Unicode Consortium (and its committees) take themselves. Some of these (though not all) take "native-speaking authorities" as an input, but the final decision is ultimately theirs.

There's a very important difference between being made an adviser (having "input") and being a decision-maker, and however much the decision-makers may value the advisers, we can't pretend that those are the same thing.

SiVal · on March 19, 2015

You claim that native Bengali speakers on the UTC would have designed the character set your way, the real native speaker way, instead of the bad design produced by these "mostly white men".

But the character set WAS designed by native speakers, by experts chosen not by the UTC but by the native speaking authorities themselves. The UTC merely verified that these native speaking experts were still satisfied with their own standard after using it for a while, and when they said they were, the UTC adopted it.

You go on about how the real issue is the authority of these white men and how the native speakers are restricted to a minor role as mere advisers, and yet the native speakers, as is usually the case, had all the authority they needed to create the exact character set that THEY wanted and get it adopted into Unicode. That's the way the UTC wants to use its authority in almost all cases of living languages.

Unfortunately for your argument, these native speakers didn't need any more authority to get the character set they wanted into Unicode. They got it. You just don't like their choices, but you prefer to blame it on white men with authority.

derefr · on March 18, 2015

It seems to me that the high-level issue here is that Unicode is caught between people who want it to be a set of alphabets, and people who want it to be a set of graphemes.

The former group would give each "semantic character" its own codepoint, even when that character is "mappable" to a character in another language that has the same "purpose" and is always represented with the same grapheme (see, for example, latin "a" vs. japanese full-width "a", or duplicate ideograph sets between the CJK languages.) In extremis, each language would be its own "namespace", and a codepoint would effectively be described canonically as a {language, offset} pair.

The latter group, meanwhile, would just have Unicode as a bag of graphemes, consolidated so that there's only one "a" that all languages that want an "a" share, and where complex "characters" (ideographs, for example, but what we're talking about here is another) are composed as ligatures from atomic "radical" graphemes.

I'm not sure that either group is right, but trying to do both at once, as Unicode is doing, is definitely wrong. Pick whichever, but you have to pick.

yuriks · on March 17, 2015

Unicode makes extensive use of combining characters for european languages, for example to produce diacritics: ìǒ or even for flag emoji. A correct rendering system will properly combine those, and if it doesn't then that's a flaw in the implementation, not the standard. It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.

chimeracoder · on March 17, 2015

> Unicode makes extensive use of combining characters for european languages, for example to produce diacritics: ìǒ or even for flag emoji.

But it doesn't, for example say that a lowercase "b" is simply "a lowercase 'l' followed by an 'o' followed by an invisible joiner", because no native English speaker thinks of the character "b" as even remotely related to "lo" when reading and writing.

> It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.

I'm saying that Unicode only does it in English where it makes semantic sense to a native English speaker. It does it in Bengali even where it makes little or no semantic sense to a native Bengali speaker.

maxlybbert · on March 17, 2015

> > It seems like you're trying to single out combining pairs as "less legitimate" when they're extensively used in the standard.

> I'm saying that Unicode only does it in English where it makes semantic sense to a native English speaker.

Well, combining characters almost never come up in English. The best I can think of would be the use of cedillas, diaereses, and acute accents in words like façade, coördinate and renownèd (I've been reading Tolkien's translation of Beowulf, and he used renownèd a lot).

Thinking about the Spanish I learned in high school, ch, ll, ñ, and rr are all considered separate letters (i.e., the Spanish alphabet has 30 letters; ch is between c and d, ll is between l and m, ñ is between n and o, and rr is between r and s; interestingly, accented vowels aren't separate letters). Unicode does not provide code points for ch, ll, or rr; and ñ has a code point more from historical accident than anything (the decision to start with Latin1). Then again, I don't think Spanish keyboards have separate keys for ch, ll, or rr.

Portuguese, on the other hand, doesn't officially include k or y in the alphabet. But it uses far more accents than Spanish. So, a, ã and á are all the same letter. In a perfect world, how would Unicode handle this? Either they accept the Spanish view of the world, or the Portuguese view. Or, perhaps, they make a big deal about not worrying about languages and instead worrying about alphabets ( http://www.unicode.org/faq/basic_q.html#4 ).

They haven't been perfect. And they've certainly changed their approach over time. And I suspect they're including emoji to appear more welcoming to Japanese teenagers than they were in the past. But (1) combining characters aren't second-class citizens, and (2) the standard is still open to revisions ( http://www.unicode.org/alloc/Pipeline.html ).

Maken · on March 17, 2015

Spanish speaker here. "ch" and "ll" being separate letters has been discussed for a long time and finally the decision was that they weren't separate letters but a combination of two [1]. Meanwhile, "ñ" stands as a letter of its own.

Accented vowels aren't considered different letters in Spanish because they affect the word they are in rather than the letter, as they serve to indicate which one is the "strong" syllable in a word. From a Spanish view of point, "a" and "á" are exactly the same letter.

[1] http://www.rae.es/consultas/exclusion-de-ch-y-ll-del-abeceda...

maxlybbert · on March 18, 2015

That's news to me. Perhaps I'll have better luck finding words like "chancho" in a dictionary; I'll be right to look in the c's!

darklajid · on March 17, 2015

I'm coming from a German background and I sympathize with the author.

German has 4 (7 if you consider cases) non-ASCII characters: äüöß(and upper-case umlauts). All of these are unique, well-defined codepoints.

That's not related to composing on a keyboard. In fact, although I'm German I'm using the US keyboard layout and HAD to compose these characters now. But I wouldn't need to and the result is a single codepoint again..

the_mitsuhiko · on March 17, 2015

> German has 4 (7 if you consider cases) non-ASCII characters: äüöß(and upper-case umlauts). All of these are unique, well-defined codepoints.

German does not consider "ä", "ö" and "ü" letters. Our alphabet has 26 letters none of which are the ones you mentioned. In fact, if you go back in History it becomes even clearer that those letters used to be ligatures in writing.

They still are collated as the basic letters the represent, even if they sound different. That we use the uncomposed representation in Unicode usually, is merely a historical artifact because of iso-8859-1 and others, not because it logically makes sense.

When you used an old typewriter you usually did not have those keys either, you composed them.

darklajid · on March 17, 2015

One by one:

I'm confused by your use of 'our' and 'we'. It seems you're trying to write from the general point of view of a German, answering .. a German?

Are umlauts letters? Yes. [1] [2] Maybe not the best source, but please provide a better one if you disagree so that I can actually understand where you're coming from.

I understand - I hope? - composition. And I tend to agree that it shouldn't matter much if the input just works. If I press a key labeled ü and that letter shows up on the screen, I shouldn't really care if that is one codepoint or a composition of two (or more). I do think that the history you mention is an indicator that supports the author's argument. There IS a codepoint for ü (painful to type..). For 'legacy reasons' perhaps. And it feels to me that non-ASCII characters - for legacy reasons or whatever - have better support than the ones he is complaining about, if they originate in western Europe/in my home country.

Typewriters and umlauts:

http://i.ebayimg.com/00/s/Mzk2WDQwMA==/$T2eC16N,!)sE9swmYlFP...

(basically I searched for old typewriter models, 'Adler Schreibmaschinen' results in lots of hits like that). Note the separate umlaut keys. And these are typewriters from .. the 60s? Maybe?)

1: https://de.wikipedia.org/wiki/Alphabet 2: https://de.wikipedia.org/wiki/Deutsches_Alphabet

ptaipale · on March 18, 2015

I am not entirely sure if Germans count umlauts as distinct characters or modified versions of the base character. And maybe it is not so important; they still do deserve their own code points.

Note BTW that in e.g. Swedish and German alphabets, there are some overlapping non-ASCII characters (ä, ö) and some that are distinct to each language (å, ü). It is important that the Swedish ä and German ä are rendered to the same code point and same representation in files; this way I can use a computer localised for Swedish and type German text. Only when I need to type ü I need to compose it from ¨ and u, while ä and ö are right on the keyboard.

The German alphabetical order supports the idea that umlauts are not so distinct from their bases: it is

AÄBCDEFGHIJKLMNOÖPQRSßTUÜVWXYZ while the Swedish/Finnish one is ABCDEFGHIJKLMNOPQRSTUVWXYZÅÄÖ

This has the obvious impacts on sorting order.

BTW, traditionally Swedish/Finnish did not distinguish between V and W in sorting, thus a correct sorting order would be

Vasa

Westerlund

Vinberg

Vårdö

- the W drops right in the middle, it's just an older way to write V. And Vå... is at the end of section V, while Va... is at the start.

Argorak · on March 18, 2015

Umlauts are not distinct characters, but modifications of existing ones to indicate a sound shift.

http://en.wikipedia.org/wiki/Diaeresis_%28diacritic%29

German has valid transcriptions to their base alphabet for those, e.g "Schreoder" is a valid way to write "Schröder".

ß, however, is a separate character that is not listed in the german alphabet, especially because some subgroups don't use it. (e.g. swiss german doesn't have it)

darklajid · on March 18, 2015

Two things

1) To avoid confusing readers that don't know German or are used to umlauts: The correct transcription is base-vowel+e (i.e. ö turns to oe - the example given is therefor wrong. Probably just a typo, but still)

2) These transcriptions are lossy. If you see 'oe' in a word, you cannot (always) pronounce it as umlaut. The second e just might indicate that the o in oe is long.

3) ß is a character in the alphabet, as far as I'm aware and as far as the mighty Wikipedia is concerned, as I pointed out above. If you have better sources that claim something else, please share those (I .. am a native speaker, but no language expert. So I'm genuinely curious why you'd think that this letter isn't part of the alphabet).

Fun fact: I once had to revise all the documentation for a project, because the (huge, state-owned) Swiss customer refused perfectly valid German, stating "We don't have that letter here, we don't use it: Remove it".

Argorak · on March 19, 2015

1) It's a typo, yes. Thanks! 2) Well, they are lossy in the sense that pronunciation is context-sensitive. The number of cases where you actually turn the word into another word is very small: http://snowball.tartarus.org/algorithms/german2/stemmer.html has a discussion. 3) You are right, I'm wrong. ß, ä, ö, ü are considered part of the alphabet. It's not tought in school, though (at least not in mine).

Thanks a lot for making the effort and fact-checking better then I did there :).

ptaipale · on March 18, 2015

Yes, that transcription approach is familiar; here the result of German-Swedish-Finnish equivalency of "ä" is sometimes not so good.

For instance, in skiing competitions, the start lists are for some reason made with transcriptions to ASCII. It's quite okay that Schröder becomes Schroeder, but it is less desirable that Söderström becomes Soederstroem and quite infuriating that Hämäläinen becomes Haemaelaeinen. We'd like it to be Hamalainen, just drop the dots.

vilhelm_s · on March 17, 2015

Well, they have codepoints, but not unique ones (since they can be written both using combining characters or using the compatibility pre-combined form). Software libraries dealing with unicode strings needs to handle both versions, by applying unicode normalization before doing comparisons.

The reason they have two representations is for backwards compatibility with previous character encoding standards, but the unicode standard is more complex because of this (it needs to specify more equivalences for normalization). I guess for languages which were not previously covered by any standards, the unicode consortium tries to represent things "as uniquely as possible".

masklinn · on March 17, 2015

> But I wouldn't need to and the result is a single codepoint again..

Doesn't have to be though, it'd be perfectly correct for an IME to generate multiple codepoints. IIRC, that's what you'd get if you typed those in a filename on OSX then asked for the native file path, as HFS+ stores filenames in NFD. Meanwhile Safari does (used to do?) the opposite, text is automatically NFC'd before sending. Things get interesting when you don't expect it and don't do unicode-equivalent comparisons.

1ris · on March 17, 2015

8 letters actually. 'ẞ' was added quite a while later.

darklajid · on March 17, 2015

Agreed, it exists. But then again, most systems in use today (as far as I'm aware) would turn a ß into SS, not ẞ.

Actually I think I've never seen a ẞ in use, ever. Not once.

Now I'm running around testing 'Try $programmingLanguage' services on the net. Try Clojure for example:

> (.toUpperCase "ß") "SS"

1ris · on March 17, 2015

In Haskell: isLower $ toUpper 'ß' is True. I wonder how many security holes this unexpected behaviour causes.

darklajid · on March 18, 2015

.Net seems to do the same thing, Javascript (according to jsfiddle) as well. So maybe this is more widespread than I thought (again - I have never seen that character in the wild)?

Java (as in Try Clojure) seems to do the 'expected' SS thing. Trying the golang playground I get even worse:

fmt.Println(strings.ToUpper("ßẞ"))

returns

ßẞ

(yeah, unchanged?)

So, while I agree that you're technically correct (ẞ exists!) I do stick to my ~7 letters list for now.. It seems that's both realistic and usable.

e12e · on March 18, 2015

I think this is more related to the fact that there aren't many sane libraries implementing unicode and locales -- so you'll get either some c lib/c++ lib, system lib, java lib -- or an actual new implementation that's actually been done "seriously" -- as part of being able to say: "Yes, X does actually support unicode strings.".

Python3 got a lot of flac for the decision to break away from it's byte sequences, to it's a unicode string. But I think that was the right choice. I still understand why people writing software that only cared about network, on-the-wire, pretend-to-be-text type strings.

Then again, based on some other comments here, apparently there are still some dark corners:

    Python 3.2.3 (default, Feb 20 2013, 14:44:27) 
    [GCC 4.7.2] on linux2
    >>> s="Åßẞ"
    >>> s == s.upper().lower()
    False
    >>> s.lower()
    'åßß'

However, to complicate things:

    Python 3.4.2 (default, Dec 27 2014, 13:16:08)
    [GCC 4.9.2] on linux
    >>> s="Åßẞ"
    >>> s.lower()
    'åßß'
    >>> s.lower().upper()
    'ÅSSSS'
    >>> s == s.lower().upper()
    False
    >>> s.lower().upper() == 'ÅSSSS'
    True
    >>> 'SS'.lower()
    'ss'
    >>> 'ß'.lower()
    'ß'
    >>> 'ß'.lower().upper()
    'SS'
    >>> 'ß'.lower().upper().lower()
    'ss'

So that's fun.

e12e · on March 18, 2015

Thanks for pointing that out -- I was vaguely aware 3.2 wasn't good (but pypy still isn't up to 3.4?) -- it's what's (still) in Debian stable as python3 though. Jessie (soonish to be released) will have 3.4 though, so at that point python3 should really start to be viable (to the extent that there are differences that actually are important...).

For the record, .casefold():

    #Python 3.4:
    >>> 'Åßẞ'.casefold() == 'åßß'.casefold() == 'åssss'
    True

[ed: Also, wrt upper/lower being for display purposes -- I thought it was nice to point out that they are not symmetric, as one might expect them to (although that expectation is probably wrong in the first place...]

Veedrac · on March 18, 2015

FWIW,

- 3.2 is considered broken with a narrow unicode build (although it doesn't matter here)

- .lower and .upper are primarily for display purposes

- .casefold is for caseless matching

elros · on March 17, 2015

> Portuguese, on the other hand, doesn't officially include k > or y in the alphabet.

With no judgement towards your broader point, I'd like to point out that this is no longer the case as of the orthographic agreement of 1990[0].

As far as I know it's been added back in order to better suit African speakers.

[0] https://pt.wikipedia.org/wiki/Acordo_Ortogr%C3%A1fico_de_199...

maxlybbert · on March 18, 2015

That's good to know. I learned Portuguese in '97-'99, so the information I had was incorrect at the time. We Anericans always recited the alphabet with k and y, but our teacher said they weren't official (although he also said that Brazilians would recognize them).

nova · on March 17, 2015

> ch, ll, ñ, and rr are all considered separate letters

In Spanish "rr" has never been considered as a single letter. "Ch" and "ll" used to be, but not anymore. Ñ is, of course.

k0ga · on March 18, 2015

I think i'm older than you. I learnt in the school they were different letters, and also I remember when they were removed at the beginning of 90's

maxlybbert · on March 18, 2015

That's funny; I spent several hours of class time trilling r's to make sure we pronounced "carro" correctly, and repeating a 30 character alphabet.

jimktrains2 · on March 18, 2015

rr not being its own letter has no bearing on if you can pronounce carro correctly, just like saying church right has no bearing on if c and h are two letters or ch is a single letter.

maxlybbert · on March 18, 2015

What can I say? Apparently the textbook was wrong.

elatahualpa · on March 23, 2015

I'm afraid that I have to add my voice to the list of people raised in Spanish speaking countries prior to the 90s who was VERY clearly taught that rr was a separate letter.

This is what I recall from my childhood: http://img.docstoccdn.com/thumb/orig/113964108.png

tracker1 · on March 17, 2015

In your example... I wouldn't really care how it is stored, as long as it looks right on the display, and I don't have to go through contortions to enter it on an input device... for example, I don't care that 'a' maps to \x61 ... it's a value behind the scenes... it's the interface to that value.

As long as the typeface/font used can display the character/combination reasonably, and I can input reasonably it doesn't matter so much how it's used...

Now, having to type in 2-3 characters to get there, that's a different story, and one that should involve better input devices in most cases.

hackuser · on March 17, 2015

> I wouldn't really care how it is stored, as long as it looks right on the display

It becomes a problem when you have other uses besides reading text, such as sorting or searching.

tracker1 · on March 18, 2015

That's why you can normalize the input to certain UTF-8 patterns for this purpose. Same should go for passwords before hashing them.

hetman · on March 18, 2015

Except unicode also has invididual codepoints for "ǒ" and "ì".

Manishearth · on March 18, 2015

(I can't read Bengali, so I'm not entirely sure what the johphola is, but I'm trying to relate this to Devanagari -- if my analogy is mistaken or if you don't know Devanagari, let me know)

I don't see anything discriminatory about not giving glyphs their own codepoints. Devanagari has tons of glyphs which are logically broken up into consonants, modifying diacritics, and bare vowels. Do we really need separate codepoints for these[1] when they can be made by combinations of these[2]?

I mentioned this as a reply to another comment, but it's only a Unicode problem if:

- There is no way to write the glyph as a combination of code points - There is a way to write the glyph as a combination of code points, but the same combination could mean something else (not counting any rendering mistakes, the question is about if Unicode defines it to mean something else)

If it's hard to input, that's the fault of the input method. If it doesn't combine right, it's the fault of the font.

[1]: http://en.wikipedia.org/wiki/Devanagari#Biconsonantal_conjun... [2]: http://en.wikipedia.org/wiki/Devanagari_%28Unicode_block%29

ramviswanadha · on March 17, 2015

> whereas the characters that are required to type a jophola have no obvious semantic, phonetic, or orthographic connection to the jophola.

Then it is an input method issue not an encoding issue.

png_hero · on March 18, 2015

Doesn't this reply invalidate your whole point of the article?

Seems like there was a lot of hard work put in on making the "khanda-ta" work properly?

lilyball · on March 17, 2015

> For me, many of these problems are more of an input issue, than an encoding issue.

I think you've hit the nail on the head here. I'm a native English speaker, so I may in fact be making bad assumptions here, but I think the biggest issue here is that people conflate text input systems with text encoding systems. Unicode is all about representing written text in a way that computers can understand. But the way that a given piece of text is represented in Unicode bears only a loose relation to the way the text is entered by the user, and the way it's rendered to screen. A failure at the text input level (can't input text the way you expect), or a failure at the text rendering level (text as written doesn't render the way you expect), are both quite distinct from a failure of Unicode to accurately represent text.

exelius · on March 17, 2015

They're not unrelated though. You have to have a way to get from your input format to the finished product in a consistent way, and the glyph set you design has a large bearing on that. You can't solve it completely with AI, because then you just have an AI interpretation of human language, not human language. A language like Korean written in Hangul would need to create individual glyphs from smaller ones through the use of ligatures, but a similar approach couldn't be taken to Japanese, since many glyphs have multiple meanings depending on context. How should these be represented in Unicode? Yes, these are likely solved problems, but I'm sure there are other examples of less-prominent languages that have similar problems but nobody's put in the work to solve them because the languages aren't as popular online.

You need to be able to represent the language at all stages of authorship - i.e. Unicode needs to be able to represent a half-written Japanese word somehow (yes, Japanese is a bad example because it has a phonetic alphabet as well as a pictograph alphabet).

Anyway, trying to figure out a single text encoding scheme capable of representing every language on Earth is not an easy task.

PaulAJ · on March 17, 2015

Its not an AI issue, just a small matter of having lots of rules. Moreover this is not just an issue for non-Western languages: the character â (lower case "a" with a circumflex) can be represented either as a single code-point U+00E2 or as an "a" combined with a "^". Furthermore Unicode implementations are required to evaluate these two versions as being equal in string comparisons, so if you search for the combined version in a document, it should find the single code point instances as well.

lilyball · on March 17, 2015

> Unicode implementations are required to evaluate these two versions as being equal in string comparisons

What do you mean by "required"? There's different forms of string equality. It's plausible to have string equality that compares the actual codepoint sequence, vs string equality that compares NFC or NFD forms, and there's string equality that compares NFKC or NFKD forms. And heck, there's also comparing strings ignoring diacritics.

Any well-behaving software that's operating on user text should indeed do something other than just comparing the codepoints. In the case of searching a document, it's reasonable to do a diacritic-insensitive search, so if you search for "e" you could find "é" and "ê". But that's not true of all cases.

PaulAJ · on March 18, 2015

Its part of the Unicode standard. See http://en.wikipedia.org/wiki/Unicode_equivalence for details.

(OK, so "required" might be overstating it; you are perfectly free to write a program that doesn't conform to the standard. But most people will consider that a bug unless there is a good reason for it)

lilyball · on March 18, 2015

Unicode defines equivalence relations, yes. But nowhere does is a program that uses Unicode required to use a equivalence relation whenever it wishes to compare two strings. It probably should use one, but there are various reasons why it might want strict equality for certain operations.

stevejones · on March 17, 2015

In some languages those accented characters would be different letters, sometimes appearing far away from each other in collation order. In other cases they are basically the same letter. Whereas in Hungarian 'dzs' is a letter.

lilyball · on March 17, 2015

Different languages can define different collation rules even when they use the same graphemes. For example, in Swedish z < ö, but in German ö < z. Same graphemes, different collation.

vidarh · on March 18, 2015

And we may even have more than one set of collation rules within the same language.

E.g. Norwegian had two common ways of collating æ,ø,å and their alternative forms ae, oe and aa. Phone books used to collate "ae" with æ, "oe" with ø and "aa" with å, while in other contexts "ae", "oe" and "aa" would often be collated based on their constituent parts. It's a lot less common these days for the pairs to be collated with æøå, but still not unheard of.

Of course it truly becomes entertaining to try to sort out when mixing in "foreign" characters. E.g I would be inclined to collate ö together with ø if collating predominantly Norwegian strings, since ö used to be fairly commonly used in Norway too, but these days you might also find it collated with "o".

lilyball · on March 17, 2015

Why does Unicode need to represent a half-written Japanese word? If it's half-written, you're still in the process of writing it, and this is entirely the domain of your text input system.

Which is to say, there is absolutely no need for the text input system to represent all stages of input as Unicode. It is free to represent the input however it chooses to do so, and only produce Unicode when each written unit is "committed", so to speak. To demonstrate why this is true, take the example of a handwriting recognition input system. It's obviously impossible to represent a half-written character in Unicode. It's a drawing! When the text input system is confident it knows what character is being drawn, then it can convert that into text and "commit" it to the document (or the text field, or whatever you're typing in).

But there's nothing special about drawing. You can have fancy text input systems with a keyboard that have intermediate input stages that represent half-written glyphs/words. In fact, that's basically what the OS X text input system does. I'm not a Japanese speaker, so I actually don't know whether all the intermediate forms that text input in the Japanese input systems (there's multiple of them) go through have Unicode representations, but the text input system certainly has a distinction between text that is being authored and text that has been "committed" to the document (which is to say, glyphs that are in their final form and will not be changed by subsequent typed letters). And I'm pretty sure the software that you're typing in doesn't know about the typed characters until it's "committed".

Edit: In fact, you can even see this input system at work in the US English keyboard. In OS X, with the US English keyboard, if you type Option-E, it draws the ACCUTE ACCENT glyph (´) with a yellow background. This is a transitional input form, because it's waiting for the user to type another character. If the user types a vowel, it produces the appropriate combined character (e.g. if the user types "e" it produces "é"). If the user types something else, it produces U+00B4 ACCUTE ACCENT followed by the typed character.

Qwertious · on March 18, 2015

>Why does Unicode need to represent a half-written Japanese word? If it's half-written, you're still in the process of writing it, and this is entirely the domain of your text input system.

Drafts or word documents (which are immensely simpler if just stored as unicode). Then there's the fact that people occasionally do funky things with kanji anyway, so you're doing everyone a favour by letting them half-write a word anyway.

kalleboo · on March 18, 2015

Interestingly enough, one early Apple II-era input method for Chinese would generate the characters on the fly (since the hardware at the time couldn't handle storing, searching and rendering massive fonts) meaning it could generate partial Chinese characters or ones that didn't actually exist.

http://en.wikipedia.org/wiki/Cangjie_input_method#Early_Cang...

In the Wikipedia article it even shows an example of a rare character that's not encoded in Unicode but which can be represented using this method.

Manishearth · on March 18, 2015

FWIW there is an interesting project called Swarachakra[1] which tries to create a layered keyboard layout for Indic languages (for mobile) that's intuitive to use. I've used it for Marathi and it's been pretty great.

They also support Bengali, and I bet they would be open to suggestions.

[1]: http://en.wikipedia.org/wiki/Swarachakra