How would you redo Unicode?

Tofu Type Foundry
Posts: 50
I’m not really sure how to word this question. If you could “redo” Unicode right now with no issues regarding compatibility in legacy software, what changes would you make? It seems that some of the decisions made decades ago have led to unexpected issues with modern digital typesetting. I’m curious what Unicode would look like in an “optimized” release with all language systems and languages accounted for from the start.
Tagged:
0
Comments
-
For Latin: Abolish quotesingle and quotedbl.
That might prompt keyboard manufacturers to provide separate keys for all four “curly quotes.”
I doubt that having separate code points for quoteright and apostrophe would solve more problems than it would create.0 -
Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.
Fix Hebrew canonical combining class assignments.
Consistently assign script properties based on the context in which characters are used, not on the script from which they historically derive (looking at you, not-Greek ᶿ).
Avoid unification or compatibility decompositions for letter/symbol lookalikes, so e.g. separate codepoints for hooked-f and florin sign, and for lowercase mu and micro sign.
Provide recommendations for encoding choices for similar and confusable characters, especially for digitally disadvantaged languages.
2 -
John Hudson said:Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.0
-
...we would need to define where to place the cedilla in letters like H, K, N, R or turned V.That’s already the case, independent of how such things are encoded. Personally, I am fine with floating the cedilla under the middle of these letters, in the absence of any attested forms in actual use.
1 -
John Hudson said:Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.And, of course, if I were re-doing Unicode, I would do exactly the opposite. I would provide the less popular languages, the languages of countries which entered the computer age later, the languages of countries that are less economically powerful, with a full set of pre-composed characters - as are often found in unofficial, unrecognized encodings people have used in those countries - if they are desired.Why?Because pre-composed characters make it simpler to process text in those languages. Less processing power, less complicated algorithms, less sophisticated programs are required.But I have to admit, this isn't a no-brainer. It seems like a natural consequence of the current desire to provide all peoples with full equality.But the only reason a program that handles a given language can be simpler due to the availability of pre-composed characters is if it only handles the pre-composed versions of those characters. Otherwise, having two alternatives that both need to be handled just makes things more complicated. And that means those programs won't work properly - they won't be compatible with other programs that are more sophisticated which do properly handle combining mark sequences properly, which presumably would also be exist and which would be likely to also be running on the same computers.So I do admit that what I would prefer is seriously flawed.Thus, perhaps what I would really want to see is instead for Unicode to be succeeded by two codes - one done the way John Hudson advocates, one done the way I propose, each of these codes being designed to serve a different purpose.His successor to Unicode would serve the purpose of being a logical standard for worldwide communications.My successor to Unicode would serve the purpose of either serving as a computer code, or being closely related to a computer code or codes, that are well suited to simple and straightforward computation in each particular language.0
-
The one thing that is 100% for sure worse than John Savard’s proposal, is his additional proposal to have two encoding standards.
Good grief, please, no. That way lies madness.2 -
Thomas Phinney said:The one thing that is 100% for sure worse than John Savard’s proposal, is his additional proposal to have two encoding standards.
Good grief, please, no. That way lies madness.Sadly, we've already passed this point.The world of standards is already in the grip of that sort of madness.Of course, though, my goals can be achieved without having two standards. Add in all the desired precomposed characters for those who need them... but deprecate both them and the existing ones to point modern systems in the better direction.
0 -
John Savard said:Because pre-composed characters make it simpler to process text in those languages.
Doing the complex stuff by default makes things better for minority languages. Trying to turn the processing of minority languages into the same process used for majority languages is precisely the wrong direction, and the thing that got us into this mess in the first place.2 -
How is waiting years before a precomposed accented character is added and usable on updated devices a good approach?3
-
How would you redo Unicode?a) do basic research and systematics about notation systems firstb) define usable standards with regard to font technics – not only for combined characters, but also for variant characters and ligaturesc) re-order code blocksd) straighten out terminologye) edit glyph bugs and annotation faultssince all this will never happen, f):paint a picture in oil with a flat landscape in sunset (purple sky), a timber barn on the left side (with open door), a white unicorn with golden hair on the right side and a black horse with white figures painted on it, in the middle.
2 -
I think the strongest case for this I've seen is DecoType's many presentations at Unicode Conferences. The Unicode model of Arabic is bad.0
-
The Unicode model of Arabic is bad.Is it though? The fact that DecoType have been able to achieve everything that they have in the display of Arabic script on top of a Unicode text encoding that also enables entirely different font and shaping technologies suggests that the encoding model is pretty robust. Aspects of it don’t make a lot of sense from an historical script grammar perspective, but a plain text encoding and interchange standard doesn’t need to correspond to anyone’s understanding of how a writing system works (a point that needs to be made repeatedly when e.g. Tamil users look under the hood at the general Indic model in Unicode).
There are complicated aspects of Arabic joining behaviours and language-specific forms that Unicode decided to solve in a particular way. I am not sure that other possible models improve on those solutions, because they all will shift the complexities somewhere else in the stack. If you want to see a worst-case scenario of what that can look like, consider the Mongolian encoding model, which would be greatly simplified for font developers by moving to something like the Arabic model.
0 -
Simon Cozens said:Doing the complex stuff by default makes things better for minority languages. Trying to turn the processing of minority languages into the same process used for majority languages is precisely the wrong direction, and the thing that got us into this mess in the first place.Why do I disagree with this?Well, for one thing, while today computers are a lot more powerful, as well as easier to use, than they were many years ago, the Macintosh and computers running Windows are much harder to program than old-fashioned computers with command-line interfaces instead of a GUI.There are techniques to bridge the gap, and make it possible to write programs that run on a windowed interface with menus and dialog boxes simply. But in general, most programming environments require an event-driven model or complicated class libraries or both.I want programming to be straightforward and simple. If processing text is really complicated to program, then writing programs becomes the monopoly of large companies.Dave Crossland said:I think the strongest case for this I've seen is DecoType's many presentations at Unicode Conferences. The Unicode model of Arabic is bad.While I'm not going to defend the Unicode model of Arabic, I will note that here what's happening is very different from the case I've discussed.Here, Unicode text in Arabic is simply a sequence of Arabic letters. The font needs to contain glyphs with associated logic that do all the work of turning a sequence of Arabic letters into the correct glyphs that will give Arabic text its correct appearance.So the lives of programmers are made easy. It's the font designer who feels the pain.Here, I don't have a better idea. There are different ways of writing Arabic. The most common way (Naskh) can be, sort of, handled by associating four (or even two) glyphs with each letter - but that doesn't actually do justice even to that writing style. For example, the Arabic letters for J and K should shift the baseline downwards in proper Naskh handwriting; I've seen a description of an early attempt to reduce Naskh to metal type that preserved this, but in general that feature was ignored to make printing simpler.Another form (Nastaliq) of the Arabic script - even though the countries most commonly using it are Farsi-speaking or Urdu-speaking - has a different set of rules.So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.0
-
So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.And one that no one is proposing.0
-
John Hudson said:So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.And one that no one is proposing.No, of course not. I used that possibility as an example to explain why encoding the Arabic script as glyphs is absurd.Before I knew about Nastaliq, I might well have thought that encoding the Arabic script as the glyphs for Naskh would be appropriate, because that would allow printers to be cheaper. If there were only one set of glyphs for Arabic, then using glyph coding would make Arabic behave like English, which would be the simplest case. (But as I also noted, the typical way of printing Naskh debases even that script, which at one time I also did not know.)On the other hand, using precomposed accented characters "works" for French or Italian, and so on. So if people in Burma or Vietnam want it so badly that they've made their own computer codes which came too late to be recognized by Unicode - I'm inclined to root for them, even if I'm open to the possibility that the Unicode way is really the right way.And, after all, Burmese is a script like Devanagari or Tibetan, and combining vowels leads to way more combinations than just accent marks.I mean, there has to be a reason why nobody is proposing to have precomposed characters for Arabic vowel points, or, worse yet, Hebrew vowel points.And then there's Korean.I didn't know that the precomposed Korean syllables in Unicode included all possible combinations of jamo (letters) not just the ones actually used in the language; even I think that is insanity. Particularly as other languages using the Korean script use obsolete jamo for additional sounds the combinations of which aren't included, and so the use of all possible combinations fails to serve its one possible rational purpose.A modern computer keyboard for Korean, of course, includes each jamo exactly once. Mechanical typewriters for Korean could include two or three copies of some jamo to place them in different positions; a common Korean typewriter might have a "3-set" keyboard, producing only a crude result, while fancier ones would have a "5-set" keyboard, with two versions of most vowels and three versions of many consonants, to approach the quality of typesetting.Since dead keys or backspacing are still required, and the 5-set typewriter only approaches the quality of typesetting, I don't think a set of glyphs based on the 5-set typewriter would be a workable compromise between precomposed syllables on the one hand and just encoding the jamo on the other, but, again, at one time I might have been willing to suggest such a thing.0
-
So if people in Burma ... want it so badly that they've made their own computer codes which came too late to be recognized by UnicodeThe Zawgyi encoding for Burmese had nothing to do with being too late to be recognized by Unicode or even any particular technical benefit. It was a hack developed during the period of international sanctions against the (previous) military dictatorship, when the country was effectively cut off from computers and software developed in the West and, at the same time, software makers in the West had little financial impetus to support Myanmar language text. Notably, the Zawgyi hack uses codepoints from the Unicode Myanmar code block, so a) its makers were familiar with Unicode and b) chose the worst possible option in terms of being both incomptatible and confusible. Yes, it uses a simplified encoding-to-display model in which some shaping forms are assigned character codepoints, but the result is a very minimally functional representation of the script for one particular language.
0 -
... which is not the only language supported by that script.2
-
John Hudson said:So if people in Burma ... want it so badly that they've made their own computer codes which came too late to be recognized by UnicodeThe Zawgyi encoding for Burmese had nothing to do with being too late to be recognized by Unicode or even any particular technical benefit. It was a hack developed during the period of international sanctions against the (previous) military dictatorship, when the country was effectively cut off from computers and software developed in the West and, at the same time, software makers in the West had little financial impetus to support Myanmar language text.Yes, it uses a simplified encoding-to-display model in which some shaping forms are assigned character codepoints, but the result is a very minimally functional representation of the script for one particular language.Isn't Burma/Myanmar still under a military dictatorship? (Of course, that is "once again", so there indeed was a previous one, so there was no mistake in your posting.)Although I am strongly opposed to the suppression of minority ethnic groups in any country, I don't view this issue as directly germane to the issue of using the Zawgyi encoding as the basis for an encoding of the Burmese script.After all, two things can happen. Glyphs supporting other languages using the script can be added, or, after the issues involving such minority groups as the Karen are fully resolved, these minority groups may turn their backs on using the Burmese script for their languages in future. Think of how Mongolia has returned to its native script, and other nationalities have switched from Cyrillic to Latin.So I have no problems with dismissing politics - even politics I'm fully sympathetic to - as a factor in assessing the merits, or otherwise, of this aspect of character encoding.Of course, there is another argument against precomposed glyphs based on minority languages - that there are so many languages, and/or some of them are so obscure, which use a given script, that keeping track of all the additional precomposed glyphs that they would require is not practical. I suspect, though, that this is not usually true in practice, but that argument is at least potentially legitimate.0
-
Again: the result is a very minimally functional representation of the script.
At one point, I was asked to make a Zawgyi-compatible version of the Myanmar Text font, and the result was so hampered by the limitations of that model that the client decided not to proceed: it just looked too clumsy compared to the Unicode-OpenType version. Simply put, glyph encodings of complex scripts are always inflexible and limited compared to the possibilities of dynamic glyph processing on top of abstracted character encoding.
0 -
I didn't know that the precomposed Korean syllables in Unicode included all possible combinations of jamo (letters) not just the ones actually used in the language; even I think that is insanity.It was an entirely political compromise to satisfy the South Korean national standard body at at time when Unicode was not guaranteed to be an accepted standard in East Asia. I don’t think anyone considers it a good thing.0
-
John Hudson said:Simply put, glyph encodings of complex scripts are always inflexible and limited compared to the possibilities of dynamic glyph processing on top of abstracted character encoding.I can't argue with that; it is indeed perfectly possible that a script can be complex and an attempt to reduce it to a manageable set of precomposed glyphs can have as its result a grossly inadequate representation of the script.So I admit that this can be a valid argument against that kind of encoding in any given specific case.I do quibble about "always"; a script may still be describable as "complex" without the number of precomposed glyphs required to represent it adequately necessarily involving such a combinatorial explosion as to make an encoding of that nature impossible or even impractical. But just because something isn't an invariable rule doesn't mean it can't be true most of the time.(Yes, quibbling is argumentative, but a quibble does not an effective argument make, so my meaning should be clear enough that I need not resort to claiming to "contain multitudes".)0
Categories
- All Categories
- 44 Introductions
- 3.8K Typeface Design
- 817 Font Technology
- 1.1K Technique and Theory
- 635 Type Business
- 451 Type Design Critiques
- 549 Type Design Software
- 30 Punchcutting
- 139 Lettering and Calligraphy
- 86 Technique and Theory
- 53 Lettering Critiques
- 500 Typography
- 309 History of Typography
- 117 Education
- 74 Resources
- 520 Announcements
- 84 Events
- 107 Job Postings
- 160 Type Releases
- 168 Miscellaneous News
- 271 About TypeDrawers
- 53 TypeDrawers Announcements
- 117 Suggestions and Bug Reports