How would you redo Unicode?

Tofu Type Foundry · May 2025

I’m not really sure how to word this question. If you could “redo” Unicode right now with no issues regarding compatibility in legacy software, what changes would you make? It seems that some of the decisions made decades ago have led to unexpected issues with modern digital typesetting. I’m curious what Unicode would look like in an “optimized” release with all language systems and languages accounted for from the start.

Nick Shinn · May 2025

For Latin: Abolish quotesingle and quotedbl.
That might prompt keyboard manufacturers to provide separate keys for all four “curly quotes.”

I doubt that having separate code points for quoteright and apostrophe would solve more problems than it would create.

John Hudson · May 2025

Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.

Fix Hebrew canonical combining class assignments.

Consistently assign script properties based on the context in which characters are used, not on the script from which they historically derive (looking at you, not-Greek ᶿ).

Avoid unification or compatibility decompositions for letter/symbol lookalikes, so e.g. separate codepoints for hooked-f and florin sign, and for lowercase mu and micro sign.

Provide recommendations for encoding choices for similar and confusable characters, especially for digitally disadvantaged languages.

Igor Freiberger · May 2025

John Hudson said:

Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.

Yes, but we would need to define where to place the cedilla in letters like H, K, N, R or turned V. Until now, I found no reliable information about this. Did you get anything new in recent years?

John Hudson · May 2025

...we would need to define where to place the cedilla in letters like H, K, N, R or turned V.

That’s already the case, independent of how such things are encoded. Personally, I am fine with floating the cedilla under the middle of these letters, in the absence of any attested forms in actual use.

John Savard · May 2025

John Hudson said:

Completely eliminate all decomposable diacritic characters and enforce use of letter + combining mark sequences in all languages.

And, of course, if I were re-doing Unicode, I would do exactly the opposite. I would provide the less popular languages, the languages of countries which entered the computer age later, the languages of countries that are less economically powerful, with a full set of pre-composed characters - as are often found in unofficial, unrecognized encodings people have used in those countries - if they are desired.

Why?

Because pre-composed characters make it simpler to process text in those languages. Less processing power, less complicated algorithms, less sophisticated programs are required.

But I have to admit, this isn't a no-brainer. It seems like a natural consequence of the current desire to provide all peoples with full equality.

But the only reason a program that handles a given language can be simpler due to the availability of pre-composed characters is if it only handles the pre-composed versions of those characters. Otherwise, having two alternatives that both need to be handled just makes things more complicated. And that means those programs won't work properly - they won't be compatible with other programs that are more sophisticated which do properly handle combining mark sequences properly, which presumably would also be exist and which would be likely to also be running on the same computers.

So I do admit that what I would prefer is seriously flawed.

Thus, perhaps what I would really want to see is instead for Unicode to be succeeded by two codes - one done the way John Hudson advocates, one done the way I propose, each of these codes being designed to serve a different purpose.

His successor to Unicode would serve the purpose of being a logical standard for worldwide communications.

My successor to Unicode would serve the purpose of either serving as a computer code, or being closely related to a computer code or codes, that are well suited to simple and straightforward computation in each particular language.

Thomas Phinney · May 2025

The one thing that is 100% for sure worse than John Savard’s proposal, is his additional proposal to have two encoding standards.

Good grief, please, no. That way lies madness.

John Savard · May 2025

Thomas Phinney said:

The one thing that is 100% for sure worse than John Savard’s proposal, is his additional proposal to have two encoding standards.

Good grief, please, no. That way lies madness.

Sadly, we've already passed this point.

https://xkcd.com/927/

The world of standards is already in the grip of that sort of madness.

Of course, though, my goals can be achieved without having two standards. Add in all the desired precomposed characters for those who need them... but deprecate both them and the existing ones to point modern systems in the better direction.

Simon Cozens · May 2025

John Savard said:
Because pre-composed characters make it simpler to process text in those languages.

Well, this just isn't true. But even if it were true, think of it the other way around: If even the majority languages had to deal with decomposed characters, software implementers would get them right.

Doing the complex stuff by default makes things better for minority languages. Trying to turn the processing of minority languages into the same process used for majority languages is precisely the wrong direction, and the thing that got us into this mess in the first place.

Denis Moyogo Jacquerye · May 2025

How is waiting years before a precomposed accented character is added and usable on updated devices a good approach?

Andreas Stötzner · May 2025

How would you redo Unicode?

a) do basic research and systematics about notation systems first

b) define usable standards with regard to font technics – not only for combined characters, but also for variant characters and ligatures

c) re-order code blocks

d) straighten out terminology

e) edit glyph bugs and annotation faults

since all this will never happen, f):

paint a picture in oil with a flat landscape in sunset (purple sky), a timber barn on the left side (with open door), a white unicorn with golden hair on the right side and a black horse with white figures painted on it, in the middle.

Dave Crossland · May 2025

I think the strongest case for this I've seen is DecoType's many presentations at Unicode Conferences. The Unicode model of Arabic is bad.

John Hudson · May 2025

The Unicode model of Arabic is bad.

Is it though? The fact that DecoType have been able to achieve everything that they have in the display of Arabic script on top of a Unicode text encoding that also enables entirely different font and shaping technologies suggests that the encoding model is pretty robust. Aspects of it don’t make a lot of sense from an historical script grammar perspective, but a plain text encoding and interchange standard doesn’t need to correspond to anyone’s understanding of how a writing system works (a point that needs to be made repeatedly when e.g. Tamil users look under the hood at the general Indic model in Unicode).

There are complicated aspects of Arabic joining behaviours and language-specific forms that Unicode decided to solve in a particular way. I am not sure that other possible models improve on those solutions, because they all will shift the complexities somewhere else in the stack. If you want to see a worst-case scenario of what that can look like, consider the Mongolian encoding model, which would be greatly simplified for font developers by moving to something like the Arabic model.

John Savard · May 2025

Simon Cozens said:

Doing the complex stuff by default makes things better for minority languages. Trying to turn the processing of minority languages into the same process used for majority languages is precisely the wrong direction, and the thing that got us into this mess in the first place.

Why do I disagree with this?

Well, for one thing, while today computers are a lot more powerful, as well as easier to use, than they were many years ago, the Macintosh and computers running Windows are much harder to program than old-fashioned computers with command-line interfaces instead of a GUI.

There are techniques to bridge the gap, and make it possible to write programs that run on a windowed interface with menus and dialog boxes simply. But in general, most programming environments require an event-driven model or complicated class libraries or both.

I want programming to be straightforward and simple. If processing text is really complicated to program, then writing programs becomes the monopoly of large companies.

Dave Crossland said:

I think the strongest case for this I've seen is DecoType's many presentations at Unicode Conferences. The Unicode model of Arabic is bad.

While I'm not going to defend the Unicode model of Arabic, I will note that here what's happening is very different from the case I've discussed.

Here, Unicode text in Arabic is simply a sequence of Arabic letters. The font needs to contain glyphs with associated logic that do all the work of turning a sequence of Arabic letters into the correct glyphs that will give Arabic text its correct appearance.

So the lives of programmers are made easy. It's the font designer who feels the pain.

Here, I don't have a better idea. There are different ways of writing Arabic. The most common way (Naskh) can be, sort of, handled by associating four (or even two) glyphs with each letter - but that doesn't actually do justice even to that writing style. For example, the Arabic letters for J and K should shift the baseline downwards in proper Naskh handwriting; I've seen a description of an early attempt to reduce Naskh to metal type that preserved this, but in general that feature was ignored to make printing simpler.

Another form (Nastaliq) of the Arabic script - even though the countries most commonly using it are Farsi-speaking or Urdu-speaking - has a different set of rules.

So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.

John Hudson · May 2025

So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.

And one that no one is proposing.

John Savard · May 2025

John Hudson said:

So if Arabic was encoded as glyphs - which glyphs? I suppose one could have glyph codes for all the most common written forms of Arabic, so that text would be associated with a specific written form, and would have to be converted. But there would be those who would, quite justifiably, call that a nightmare.
And one that no one is proposing.

No, of course not. I used that possibility as an example to explain why encoding the Arabic script as glyphs is absurd.

Before I knew about Nastaliq, I might well have thought that encoding the Arabic script as the glyphs for Naskh would be appropriate, because that would allow printers to be cheaper. If there were only one set of glyphs for Arabic, then using glyph coding would make Arabic behave like English, which would be the simplest case. (But as I also noted, the typical way of printing Naskh debases even that script, which at one time I also did not know.)

On the other hand, using precomposed accented characters "works" for French or Italian, and so on. So if people in Burma or Vietnam want it so badly that they've made their own computer codes which came too late to be recognized by Unicode - I'm inclined to root for them, even if I'm open to the possibility that the Unicode way is really the right way.

And, after all, Burmese is a script like Devanagari or Tibetan, and combining vowels leads to way more combinations than just accent marks.

I mean, there has to be a reason why nobody is proposing to have precomposed characters for Arabic vowel points, or, worse yet, Hebrew vowel points.

And then there's Korean.

I didn't know that the precomposed Korean syllables in Unicode included all possible combinations of jamo (letters) not just the ones actually used in the language; even I think that is insanity. Particularly as other languages using the Korean script use obsolete jamo for additional sounds the combinations of which aren't included, and so the use of all possible combinations fails to serve its one possible rational purpose.

A modern computer keyboard for Korean, of course, includes each jamo exactly once. Mechanical typewriters for Korean could include two or three copies of some jamo to place them in different positions; a common Korean typewriter might have a "3-set" keyboard, producing only a crude result, while fancier ones would have a "5-set" keyboard, with two versions of most vowels and three versions of many consonants, to approach the quality of typesetting.

Since dead keys or backspacing are still required, and the 5-set typewriter only approaches the quality of typesetting, I don't think a set of glyphs based on the 5-set typewriter would be a workable compromise between precomposed syllables on the one hand and just encoding the jamo on the other, but, again, at one time I might have been willing to suggest such a thing.

John Hudson · May 2025

So if people in Burma ... want it so badly that they've made their own computer codes which came too late to be recognized by Unicode

The Zawgyi encoding for Burmese had nothing to do with being too late to be recognized by Unicode or even any particular technical benefit. It was a hack developed during the period of international sanctions against the (previous) military dictatorship, when the country was effectively cut off from computers and software developed in the West and, at the same time, software makers in the West had little financial impetus to support Myanmar language text. Notably, the Zawgyi hack uses codepoints from the Unicode Myanmar code block, so a) its makers were familiar with Unicode and b) chose the worst possible option in terms of being both incomptatible and confusible. Yes, it uses a simplified encoding-to-display model in which some shaping forms are assigned character codepoints, but the result is a very minimally functional representation of the script for one particular language.

Thomas Phinney · May 2025

... which is not the only language supported by that script.

John Savard · May 2025

John Hudson said:

So if people in Burma ... want it so badly that they've made their own computer codes which came too late to be recognized by Unicode
The Zawgyi encoding for Burmese had nothing to do with being too late to be recognized by Unicode or even any particular technical benefit. It was a hack developed during the period of international sanctions against the (previous) military dictatorship, when the country was effectively cut off from computers and software developed in the West and, at the same time, software makers in the West had little financial impetus to support Myanmar language text.

Yes, it uses a simplified encoding-to-display model in which some shaping forms are assigned character codepoints, but the result is a very minimally functional representation of the script for one particular language.

Isn't Burma/Myanmar still under a military dictatorship? (Of course, that is "once again", so there indeed was a previous one, so there was no mistake in your posting.)

Although I am strongly opposed to the suppression of minority ethnic groups in any country, I don't view this issue as directly germane to the issue of using the Zawgyi encoding as the basis for an encoding of the Burmese script.

After all, two things can happen. Glyphs supporting other languages using the script can be added, or, after the issues involving such minority groups as the Karen are fully resolved, these minority groups may turn their backs on using the Burmese script for their languages in future. Think of how Mongolia has returned to its native script, and other nationalities have switched from Cyrillic to Latin.

So I have no problems with dismissing politics - even politics I'm fully sympathetic to - as a factor in assessing the merits, or otherwise, of this aspect of character encoding.

Of course, there is another argument against precomposed glyphs based on minority languages - that there are so many languages, and/or some of them are so obscure, which use a given script, that keeping track of all the additional precomposed glyphs that they would require is not practical. I suspect, though, that this is not usually true in practice, but that argument is at least potentially legitimate.

John Hudson · May 2025

Again: the result is a very minimally functional representation of the script.

At one point, I was asked to make a Zawgyi-compatible version of the Myanmar Text font, and the result was so hampered by the limitations of that model that the client decided not to proceed: it just looked too clumsy compared to the Unicode-OpenType version. Simply put, glyph encodings of complex scripts are always inflexible and limited compared to the possibilities of dynamic glyph processing on top of abstracted character encoding.

John Hudson · May 2025

I didn't know that the precomposed Korean syllables in Unicode included all possible combinations of jamo (letters) not just the ones actually used in the language; even I think that is insanity.

It was an entirely political compromise to satisfy the South Korean national standard body at at time when Unicode was not guaranteed to be an accepted standard in East Asia. I don’t think anyone considers it a good thing.

John Savard · May 2025

John Hudson said:

Simply put, glyph encodings of complex scripts are always inflexible and limited compared to the possibilities of dynamic glyph processing on top of abstracted character encoding.

I can't argue with that; it is indeed perfectly possible that a script can be complex and an attempt to reduce it to a manageable set of precomposed glyphs can have as its result a grossly inadequate representation of the script.

So I admit that this can be a valid argument against that kind of encoding in any given specific case.

I do quibble about "always"; a script may still be describable as "complex" without the number of precomposed glyphs required to represent it adequately necessarily involving such a combinatorial explosion as to make an encoding of that nature impossible or even impractical. But just because something isn't an invariable rule doesn't mean it can't be true most of the time.

(Yes, quibbling is argumentative, but a quibble does not an effective argument make, so my meaning should be clear enough that I need not resort to claiming to "contain multitudes".)

Peter Constable · May 2025

John Savard said:

I didn't know that the precomposed Korean syllables in Unicode included all possible combinations of jamo (letters) not just the ones actually used in the language...

The set of precomposed syllables doesn't include all possible combinations; it only includes all possible combinations for a certain subset of jamo, which are all combinations used in modern Korean orthography. For historic Korean texts, there are syllables that can only be represented using jamo sequences.

How would you redo Unicode?

Comments

Categories