How would you redo Unicode?

Tofu Type Foundry
Tofu Type Foundry Posts: 50
edited May 6 in Technique and Theory
I’m not really sure how to word this question. If you could “redo” Unicode right now with no issues regarding compatibility in legacy software, what changes would you make? It seems that some of the decisions made decades ago have led to unexpected issues with modern digital typesetting. I’m curious what Unicode would look like in an “optimized” release with all language systems and languages accounted for from the start.
Tagged:

Comments

  • Nick Shinn
    Nick Shinn Posts: 2,265
    For Latin: Abolish quotesingle and quotedbl.
    That might prompt keyboard manufacturers to provide separate keys for all four “curly quotes.” 

    I doubt that having separate code points for quoteright and apostrophe would solve more problems than it would create.
  • John Hudson
    John Hudson Posts: 3,372
    Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.

    Fix Hebrew canonical combining class assignments.

    Consistently assign script properties based on the context in which characters are used, not on the script from which they historically derive (looking at you, not-Greek ᶿ).

    Avoid unification or compatibility decompositions for letter/symbol lookalikes, so e.g. separate codepoints for hooked-f and florin sign, and for lowercase mu and micro sign.

    Provide recommendations for encoding choices for similar and confusable characters, especially for digitally disadvantaged languages.
  • Igor Freiberger
    Igor Freiberger Posts: 288
    Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.
    Yes, but we would need to define where to place the cedilla in letters like H, K, N, R or turned V. Until now, I found no reliable information about this. Did you get anything new in recent years?
  • John Hudson
    John Hudson Posts: 3,372
    ...we would need to define where to place the cedilla in letters like H, K, N, R or turned V.
    That’s already the case, independent of how such things are encoded. Personally, I am fine with floating the cedilla under the middle of these letters, in the absence of any attested forms in actual use.
  • John Savard
    John Savard Posts: 1,165
    edited May 6
    Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.
    And, of course, if I were re-doing Unicode, I would do exactly the opposite. I would provide the less popular languages, the languages of countries which entered the computer age later, the languages of countries that are less economically powerful, with a full set of pre-composed characters - as are often found in unofficial, unrecognized encodings people have used in those countries - if they are desired.
    Why?
    Because pre-composed characters make it simpler to process text in those languages. Less processing power, less complicated algorithms, less sophisticated programs are required.
    But I have to admit, this isn't a no-brainer. It seems like a natural consequence of the current desire to provide all peoples with full equality.
    But the only reason a program that handles a given language can be simpler due to the availability of pre-composed characters is if it only handles the pre-composed versions of those characters. Otherwise, having two alternatives that both need to be handled just makes things more complicated. And that means those programs won't work properly - they won't be compatible with other programs that are more sophisticated which do properly handle combining mark sequences properly, which presumably would also be exist and which would be likely to also be running on the same computers.
    So I do admit that what I would prefer is seriously flawed.
    Thus, perhaps what I would really want to see is instead for Unicode to be succeeded by two codes - one done the way John Hudson advocates, one done the way I propose, each of these codes being designed to serve a different purpose.
    His successor to Unicode would serve the purpose of being a logical standard for worldwide communications.
    My successor to Unicode would serve the purpose of either serving as a computer code, or being closely related to a computer code or codes, that are well suited to simple and straightforward computation in each particular language.
  • Thomas Phinney
    Thomas Phinney Posts: 2,978
    The one thing that is 100% for sure worse than John Savard’s proposal, is his additional proposal to have two encoding standards.

    Good grief, please, no. That way lies madness.
  • John Savard
    John Savard Posts: 1,165
    edited May 7
    The one thing that is 100% for sure worse than John Savard’s proposal, is his additional proposal to have two encoding standards.

    Good grief, please, no. That way lies madness.
    Sadly, we've already passed this point.
    The world of standards is already in the grip of that sort of madness.
    Of course, though, my goals can be achieved without having two standards. Add in all the desired precomposed characters for those who need them... but deprecate both them and the existing ones to point modern systems in the better direction.


  • Simon Cozens
    Simon Cozens Posts: 772
    Because pre-composed characters make it simpler to process text in those languages.
    Well, this just isn't true. But even if it were true, think of it the other way around: If even the majority languages had to deal with decomposed characters, software implementers would get them right.

    Doing the complex stuff by default makes things better for minority languages. Trying to turn the processing of minority languages into the same process used for majority languages is precisely the wrong direction, and the thing that got us into this mess in the first place.
  • How is waiting years before a precomposed accented character is added and usable on updated devices a good approach?
  • How would you redo Unicode?

    a) do basic research and systematics about notation systems first
    b) define usable standards with regard to font technics – not only for combined characters, but also for variant characters and ligatures
    c) re-order code blocks
    d) straighten out terminology
    e) edit glyph bugs and annotation faults

    since all this will never happen, f):

    paint a picture in oil with a flat landscape in sunset (purple sky), a timber barn on the left side (with open door), a white unicorn with golden hair on the right side and a black horse with white figures painted on it, in the middle.

  • Dave Crossland
    Dave Crossland Posts: 1,452
    I think the strongest case for this I've seen is DecoType's many presentations at Unicode Conferences. The Unicode model of Arabic is bad.
  • John Hudson
    John Hudson Posts: 3,372
    edited 12:04AM
    The Unicode model of Arabic is bad.
    Is it though? The fact that DecoType have been able to achieve everything that they have in the display of Arabic script on top of a Unicode text encoding that also enables entirely different font and shaping technologies suggests that the encoding model is pretty robust. Aspects of it don’t make a lot of sense from an historical script grammar perspective, but a plain text encoding and interchange standard doesn’t need to correspond to anyone’s understanding of how a writing system works (a point that needs to be made repeatedly when e.g. Tamil users look under the hood at the general Indic model in Unicode).

    There are complicated aspects of Arabic joining behaviours and language-specific forms that Unicode decided to solve in a particular way. I am not sure that other possible models improve on those solutions, because they all will shift the complexities somewhere else in the stack. If you want to see a worst-case scenario of what that can look like, consider the Mongolian encoding model, which would be greatly simplified for font developers by moving to something like the Arabic model.
  • John Savard
    John Savard Posts: 1,165
    edited 6:29AM
    Doing the complex stuff by default makes things better for minority languages. Trying to turn the processing of minority languages into the same process used for majority languages is precisely the wrong direction, and the thing that got us into this mess in the first place.

    Why do I disagree with this?
    Well, for one thing, while today computers are a lot more powerful, as well as easier to use, than they were many years ago, the Macintosh and computers running Windows are much harder to program than old-fashioned computers with command-line interfaces instead of a GUI.
    There are techniques to bridge the gap, and make it possible to write programs that run on a windowed interface with menus and dialog boxes simply. But in general, most programming environments require an event-driven model or complicated class libraries or both.
    I want programming to be straightforward and simple. If processing text is really complicated to program, then writing programs becomes the monopoly of large companies.
    I think the strongest case for this I've seen is DecoType's many presentations at Unicode Conferences. The Unicode model of Arabic is bad.

    While I'm not going to defend the Unicode model of Arabic, I will note that here what's happening is very different from the case I've discussed.
    Here, Unicode text in Arabic is simply a sequence of Arabic letters. The font needs to contain glyphs with associated logic that do all the work of turning a sequence of Arabic letters into the correct glyphs that will give Arabic text its correct appearance.
    So the lives of programmers are made easy. It's the font designer who feels the pain.
    Here, I don't have a better idea. There are different ways of writing Arabic. The most common way (Naskh) can be, sort of, handled by associating four (or even two) glyphs with each letter - but that doesn't actually do justice even to that writing style. For example, the Arabic letters for J and K should shift the baseline downwards in proper Naskh handwriting; I've seen a description of an early attempt to reduce Naskh to metal type that preserved this, but in general that feature was ignored to make printing simpler.
    Another form (Nastaliq) of the Arabic script - even though the countries most commonly using it are Farsi-speaking or Urdu-speaking - has a different set of rules.
    So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.
  • John Hudson
    John Hudson Posts: 3,372
    So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.
    And one that no one is proposing.
  • John Savard
    John Savard Posts: 1,165
    edited 4:44PM

    So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.
    And one that no one is proposing.

    No, of course not. I used that possibility as an example to explain why encoding the Arabic script as glyphs is absurd.
    Before I knew about Nastaliq, I might well have thought that encoding the Arabic script as the glyphs for Naskh would be appropriate, because that would allow printers to be cheaper. If there were only one set of glyphs for Arabic, then using glyph coding would make Arabic behave like English, which would be the simplest case. (But as I also noted, the typical way of printing Naskh debases even that script, which at one time I also did not know.)
    On the other hand, using precomposed accented characters "works" for French or Italian, and so on. So if people in Burma or Vietnam want it so badly that they've made their own computer codes which came too late to be recognized by Unicode - I'm inclined to root for them, even if I'm open to the possibility that the Unicode way is really the right way.
    And, after all, Burmese is a script like Devanagari or Tibetan, and combining vowels leads to way more combinations than just accent marks.
    I mean, there has to be a reason why nobody is proposing to have precomposed characters for Arabic vowel points, or, worse yet, Hebrew vowel points.
    And then there's Korean.
    I didn't know that the precomposed Korean syllables in Unicode included all possible combinations of jamo (letters) not just the ones actually used in the language; even I think that is insanity. Particularly as other languages using the Korean script use obsolete jamo for additional sounds the combinations of which aren't included, and so the use of all possible combinations fails to serve its one possible rational purpose.
    A modern computer keyboard for Korean, of course, includes each jamo exactly once. Mechanical typewriters for Korean could include two or three copies of some jamo to place them in different positions; a common Korean typewriter might have a "3-set" keyboard, producing only a crude result, while fancier ones would have a "5-set" keyboard, with two versions of most vowels and three versions of many consonants, to approach the quality of typesetting.
    Since dead keys or backspacing are still required, and the 5-set typewriter only approaches the quality of typesetting, I don't think a set of glyphs based on the 5-set typewriter would be a workable compromise between precomposed syllables on the one hand and just encoding the jamo on the other, but, again, at one time I might have been willing to suggest such a thing.
  • John Hudson
    John Hudson Posts: 3,372
    So if people in Burma ... want it so badly that they've made their own computer codes which came too late to be recognized by Unicode
    The Zawgyi encoding for Burmese had nothing to do with being too late to be recognized by Unicode or even any particular technical benefit. It was a hack developed during the period of international sanctions against the (previous) military dictatorship, when the country was effectively cut off from computers and software developed in the West and, at the same time, software makers in the West had little financial impetus to support Myanmar language text. Notably, the Zawgyi hack uses codepoints from the Unicode Myanmar code block, so a) its makers were familiar with Unicode and b) chose the worst possible option in terms of being both incomptatible and confusible. Yes, it uses a simplified encoding-to-display model in which some shaping forms are assigned character codepoints, but the result is a very minimally functional representation of the script for one particular language.
  • Thomas Phinney
    Thomas Phinney Posts: 2,978
    ... which is not the only language supported by that script.