Digital Greek Typography is Broken

Thomas Phinney · February 20

… says this petition:

https://www.openpetition.eu/gr/petition/online/digital-greek-typography-is-broken-improve-standards-and-demand-fixes-in-all-software

I certainly buy that there are many problems, but some things seem to be getting blamed on Unicode that are actually to do with downstream implementations.

John Hudson · February 20

From that openpetition page:

1. Prioritize a comprehensive review and revision of Unicode definitions and normalization rules for Greek, in consultation with native Greek speakers and experts in Greek typography.

Very few of the software problems cited in the petition are due to Unicode normalisation per se, and Unicode normalisation cannot be changed due to stability agreements between standards bodies, so this seems both a distraction and a non-starter.

mitradranirban · February 20

In 2002 we had similar problem with Bangla digital typography regarding Khanda Ta, a Yaphalaa . Initially unicode people told the specifications is alright but the problem is in implementation. Later when difficulties in implementing existing specifications, unicode consortium relented by changing the specifications and encoding khanda ta as a separate character. Luckily there was not much backward compatibility issues then. But for Greek any change in specifications now will have big compatibility issues.

Thomas Phinney · February 20

With regard to encoding, Unicode can ADD new characters. They can even deprecate existing characters… but never remove them from the specification altogether.

But most of the Greek issues with encoding seem to be about apps (or even fonts) doing things incorrectly, when enough correct and distinct codepoints already exist.

Denis Moyogo Jacquerye · February 20

The ATD3 presentation is online: https://vimeo.com/1058500542#t=63:12

Andreas Stötzner · February 20

this initiative may be helpful in increasing awareness of those issues, but I doubt that this petition alone will have much effect to improvements. Foremost, I recommend to precisely define which problem arises in which scenario; and, most important, to distinguish a) encoding issues; from b) application issues, c) font bugs and d) keyboard issues. In the text of the petition, these four (!) aspects are too much mixed into each other. It won’t help to blame e.g. Unicode for a bug, lets say, in a specific font or in the source code of a language setting (–› hyphenation rules).

As far as I can tell there are no (or no grave) bugs in Unicode for Greek. If someone thinks there are, file a proposal to be adressed to UTC directly. However, as Thomas, mentioned, stability policy demands that existing encodings will be kept unaltered.

Only as a side-aspect: the so-far unencoded Greek Omikron-Upsilon character (rather: glyph) may get encoded in the near future.

Kent Lew · February 20

I saw the presentation in the livestream, but want to watch it again sometime as the slides went by very quickly, and some of the cited problems deserve closer examination.

I am sympathetic with the frustrations and support advocacy for solutions; but I also agree that it seems ire is misdirected at Unicode. The use of the term “normalization” in a broad array of issues further confuses the matter since it has a specific meaning in the context of Unicode, which is not always related to the observed problems, that I can tell (and may not align with what is actually being complained about).

In line with Andreas’s comment, the technical locus of each issue needs to be further pinpointed in order to direct appeals and apply pressure where it is most likely to yield real, practical results.

One issue that may be somewhat addressable by font makers is the problem of mixed fallbacks, where a π or µ from one [usually non-Greek] font is interspersed with true Greek setting from .

And in this instance, Unicode may be somewhat culpable. I think the handling of Omega/Ohm, Delta/increment, mu/micro, and especially lack of equivalent non-Greek pairing for pi has played a role.

The problem seems to stem from the fact that most fonts include Ω∆πµ for scientific & non-Greek application and not designed in full Greek context, and yet these are encoded as Greek rather than their non-Greek equivalents. The reason for this, I believe, has much to do with legacy codepages and also with keyboard implementations, but also seems to trace back to Unicode determinations of equivalence.

Font makers might do some small service by not including these four in fonts that do not otherwise support Greek (at least for those destined for web use or other environments that implement fallback stacks). That way, Greek text might fall back to some consistent representation.

This would mean that any non-Greek reference would also have to fall back, which might not be pretty. The glyphs could still be included with their non-Greek codepoints, but most non-Greek keyboards have unfortunately chosen to input the Greek codepoint (due to Unicode equivalence, I imagine), which is part of the problem and perhaps also a legitimate issue.

John Hudson · February 20

The Unicode encoding of Greek is messy and certainly not as simple or easy to implement as it should have been. As Kent notes, it includes some equivalences between Greek letters and Greek-derived symbols that have been a problem because of choices made in implementations, notably in 8-bit codepages and the long tail of inherited dependencies. I suspect any ‘review and revision of Unicode definitions and normalization’ and establishment of ‘a standardized character set for digital Greek that supports all necessary characters, diacritics, and typographic conventions’ is going to result simply in a list of ‘Use these characters’ and ’Don’t use these characters’.

The petition mentions ‘incorrect case conversion’. Greek casing is complicated by factors resulting from modern Greek typographic practice: some precomposed diacritic characters exist only as lowercase because they do not occur in word-initial position and in all-caps would, conventionally, lose their diacritic marks. These encodings require one-to-many, precomposed-to-base+mark conversions. Some aspects of casing—notably contextual behaviour of mark suppression in all-caps—are put onto font makers to handle at the glyph substitution level, but I think that is necessary because the suppression of marks in all-caps settings is a modern convention of Greek typography and not a consistent practice of the script. It would break correct display of many centuries of Greek text if that aspect of casing were applied at the character level in software case conversion implementations.

I would say that Greek, as encoded in Unicode and needing to be supported at the glyph substitution level to affect correct display, meets the definition of a ‘complex script’. I suspect some of the frustration arises from the assumption—not only on the part of users but also on the part of implementing software makers—that Greek must be a simple script because it is European and alphabetical, rather than a Middle Eastern cursive abjad or Indic alphasyllabery.

Thomas Phinney · February 20

For the symbols vs Greek problem, seems like one reasonable solution would be:

- fonts with full Greek support put the math/tech symbol characters at the correct codepoints (already happening)
- fonts with the symbols but not the corresponding Greek put each symbol at BOTH the symbol codepoint and the Greek codepoint (a revision required), so it works in legacy environments as well as future environments
- Most importantly, future apps/environments use the correct codepoints for the symbols, and fall back to the Greek codepoint only if the symbol codepoints are missing in a given font

One could add an optional flag to the OpenType format, that if “on” would indicate that “hey I use the proper symbols codepoints for those Greek-like symbols”; but that would only work in a future app/environment that looked for it, in which case it could do fallback.

John Hudson · February 20

Thomas, some of the Greek-derived symbol characters have compatibility decompositions to Greek letter characters, i.e. one-directional decompositions that are not standard normalizations but may be applied. These reflect compatibilities in old 8-bit character sets, where e.g. Greek μ and the micro symbol used the same decimal encoding. These compatibility decompositions have long tail dependencies in software, and since they’re not wrong from a Unicode standardization perspective, there isn’t any impetus for software makers to track down those inherited dependencies and change them. I think compatibility decompositions—unlike canonical decompositions—may not be subject to stability agreements, so could perhaps be changed. Having a clean encoding distinction between Greek letters and Greek-derived symbols would be helpful; of course, it doesn’t guarantee that anyone is going to go and clean up existing code bases.

John Hudson · February 20

The ano teleia normalization problem is a bad one. U+0387 GREEK ANO TELEIA has a canonical decomposition to U+00B7 MIDDLE DOT, which is entirely inappropriate because it conventionally sits too low to be used as an ano teleia. Because this is a canonical decomposition, it cannot be changed in Unicode. So there is always a chance that U+0387 is going to be converted to U+00B7. The issue can be addressed by a grek script locl substitution

sub periodcentered by anoteleia;

(followed by case and smcp substutions of appropriate height variants for all-caps and smallcap ano teleia), but that means an actual middle dot is unavailable for Greek text, which might be an issue for e.g. transcribing coin or seal inscriptons.

John Hudson · February 20

The Greek question mark is an interesting case. This also has a canonical decomposition, from U+037E GREEK QUESTION MARK to U+003B SEMICOLON.

The ATD3 presentation and the petition both suggest that this is a problem because it prevents a disinct form being used for the Greek question mark. As with ano teleia, a grek locl glyph can be implemented, and should work so long as the script=common property of the semicolon means it is rolled into the adjacent Greek glyph run for OTL processing. And unlike ano teleia vs middle dot, I think there is no context in which the common semicolon might be used distinctively in Greek text.

But I am also wondering what a distinct form of Greek question mark would look like? When does it not have the same shape as the semicolon?

Kent Lew · February 20

Thomas Phinney said:

- fonts with the symbols but not the corresponding Greek put each symbol at BOTH the symbol codepoint and the Greek codepoint (a revision required), so it works in legacy environments as well as future environments

Thomas, I think you miss my point. A font with the symbol in both codepoints but not the rest of the corresponding Greek is what I believe is causing the problems of mixed font display: a specified font doesn’t support Greek so the display falls back through the stack to one that does — but, the specified font does contain Ω∆µπ Greek codepoints and so those originals get mixed in with the rest of the Greek from the fallback and don’t mesh well.

If the font doesn’t support Greek, having only the symbols for non-Greek use would solve that problem. Except that modern non-Greek keyboards only put in the Greek codepoints in non-Greek settings, rather than symbol codepoints, so most texts don’t properly distinguish the usage.

Mathematicians will likely tell you that the symbols for use in math & science are the Greek letters. Thus the Unicode equivalence determinations. But that seems to be exactly what has led to this problem for the Greek users.

Kent Lew · February 20

For the reasons you point out, John, the ire regarding ano teleia and Greek question mark is rightly directed at Unicode, precisely because of those canonical determinations. Faulting “normalization” is warranted in these cases. But as you say, the canonicality probably means that ship has sailed and unlikely to be called back to port.

Nick Shinn · February 20

@John Hudson

I suspect some of the frustration arises from the assumption—not only on the part of users but also on the part of implementing software makers—that Greek must be a simple script because it is European and alphabetical, rather than a Middle Eastern cursive abjad or Indic alphasyllabery.

Perhaps the complexities involved in typesetting Greek for Renaissance polyglot bibles, full of cursivity, could be a benchmark.

John Hudson · February 21

Regarding the incorrect display of some characters with symbol forms or fallback to other fonts: this is something that needs to be examined on a case-by-case basis, because I don’t think there is a single explanation for what is happening in all environments. This sort of thing is a legacy of 8-bit encodings, which were notoriously platform-specific, and then of the process by which those 8-bit encodings were translated—again, independently on different platforms—to ‘codepages’ of Unicode codepoints. Because, to save space, Greek letters and Greek-derived symbols shared decimal codepoints in 8-bit encodings, how they were interpreted in the translation to Unicode codepages was not consistent, and the mix of canonical and compatibility mappings and unifications in Unicode reflect that: some pairs of characters are always considered equivalent, some are sometimes equivalent in some places, and some remain unified on a single codepoint (notably the π letter/symbol).

I think the messiness of this in part reflects Unicode, at a particular point, being more descriptive than prescriptive. Software companies had done things with Greek and were doing things with Greek, and Unicode was trying to capture that and apply some flexible rationale to cover the variety of things done. This is the case for a lot of the scripts encoded early in the history of the standard: none of this mess would pass muster in a script encoding proposal today.

John Savard · February 24

I have just finished watching this YouTube video.

https://www.youtube.com/watch?v=yPicWqEaAwc

Apparently Safari fails to render Unicode-compliant Mongolian-language fonts correctly.

I have always felt that an international standard should treat every language equally. French gets pre-composed characters for all of its accented letters? Then so should Burmese and every other language. One glyph=one codepoint makes the design of printers as simple for other languages as it is for English.

If the Unicode consortium isn't interested, then the countries affected adversely by this should get together, and define something new: it could be called Equicode, for example, and operating systems, browsers, and so on (i.e. Adobe Acrobat) could just switch to the new standard (or at least support it as an alternative, since handling legacy material in Unicode would be needed for quite some time to come).

Simon Cozens · February 24

John Savard said:

I have just finished watching this YouTube video.
https://www.youtube.com/watch?v=yPicWqEaAwc
Apparently Safari fails to render Unicode-compliant Mongolian-language fonts correctly.

Okay. So. Unicode has nothing to do with fonts and rendering. There is not a straightforward correspondence between letters and sounds in English. What she says about Chinese is incorrect because of regional differences between characters. So not a great start. But from there she develops her main point, which appears to be "because we have to do shaping, it's complicated, and therefore wrong". But many things about computers are complicated, nd that does not mean they are wrong. More complex scripts require more complex fonts? What a surprise.

I have always felt that an international standard should treat every language equally. French gets pre-composed characters for all of its accented letters? Then so should Burmese and every other language.

I wonder if there was a good reason why Unicode has done things the way it has? (Hint: there was.) I am not claiming that Unicode is perfect. But by goodness, understand it before dumping all font problems at its door.

Script complexity is like a waterbed - it has to go somewhere; push it down in one area, it simply pops up somewhere else. If Equicode Mongolian has a code point for every form, you all you've done is pushed the complexity onto the IME. Even if you've made "printers" simpler (doubtful), you've made "keyboards" more difficult - not a win.

If the Unicode consortium isn't interested, then the countries affected adversely by this should get together, and define something new.

Include obligatory XKCD cartoon here.

could just switch

In the immortal words of Inigo Montoya, I do not think that means what you think it means.

John Hudson · February 24

I have always felt that an international standard should treat every language equally. French gets pre-composed characters for all of its accented letters? Then so should Burmese and every other language.

That statement assumes that precomposed, atomic diacritic characters are better. They are not. The only reason a bunch of European diacritics ended up encoded that way in Unicode is that the standard guaranteed a one-to-one mapping with pre-existing 8-bit national standards. The more general principle that diacritics should be encoded as sequences of base + combining mark(s) is a better architectural model, and I would love to have a text encoding and layout standard in which there were no precomposed diacritic encodings. We’re now at a point where we’re creating glyphs for these characters in fonts simply to meet the legacy requirements of software from almost fifty years ago, despite having access to much more flexible font and layout technologies. I would love to be able to make smaller, smarter fonts for European languages that could safely presume fully decomposed character string input.

Burmese script shaping is in no way akin to Latin base+mark diacritic handling. The complexities involved are typical of all Brahmi-derived writing systems, and trying to accommodate them via a ‘one glyph=one codepoint’ model takes you down the obsolete Arabic presentation forms route: you simply end up with a huge number of precomposed typeforms that isn’t flexible enough to represent all of the things that can potentially occur in the script, especially when you start writing languages other than Burmese.

As for Mongolian, I have worked on Mongolian fonts and the encoding model in Unicode is deeply flawed—it requires one to try to implement aspects of Mongolian gender grammar at the glyph substitution level—, but my understanding is that it is that way because of the insistence of regional standards bodies. A one-to-one character-to-glyph mapping isn’t needed to resolve the issues: as the video linked also suggests, something like the Arabic encoding model could be applied to Mongolian, in which characters are encoded based on their joining behaviour. I have spoken with people at Unicode who would like to add characters for Mongolian to move it towards the Arabic model, which would remove things like gender contextual GSUB rules from font production. That level of complexity makes sense to me.

Nick Shinn · February 24

@John Hudson

We’re now at a point where we’re creating glyphs for these characters in fonts simply to meet the legacy requirements of software from almost fifty years ago, despite having access to much more flexible font and layout technologies.

Backwards compatibility is necessary, to enable longevity of older hardware and software. Messiness is the price we pay for that. The alternative is familiar—a constant rat race of planned obsolescence, pardon me, progress, and the resultant pile of debris.

John Hudson · February 24

Technically, we are long past the point where we could have normalized all text to Unicode NFD (fully decomposed), and still have been able to support older fonts at a buffered stage during the initial cmap mapping. NFD is already used throughout a wide array of text processing operations such as searching and sorting, so fonts and layout seem peculiarly stuck in a precomposed world. It gets sillier every year, because all font tools now use exactly the same anchor mechanism to build precomposed composites as could build the same typeforms on-the-fly during text rendering. We are literally building a thing and then not using it.

Nick Shinn · February 25

So, why doesn’t font editing software “decompose” glyphs, at least as an option?

John Hudson · February 25

Because text engines treat European languages as if its still the 1990s, so if precomposed diacritic characters are missing they fall back to other fonts or display .notdef, even if the font contains everything it needs to display the canonically decomposed sequence. This is the irony of these languages and character sets originally taking precedence over other scripts: they’re stuck in the past.

Years ago, I proposed a new cmap subtable format for one-to-many character-to-glyph mappings, which would have provided a direct route for decomposition of incoming precomposed characters. What I had in mind would also have been able to handle arbitrary decompositions as defined by the font maker, not just Unicode canonical decompositions. So, for example, one would have been able to make Arabic fonts from small sets of archigrapheme shapes and marks, without having to go through the initial step of mapping from Unicode characters to default representations and then decomposing in GSUB.

Yes, there these sorts of new mechanisms break backwards compatibility, but we can manage that in a transition period as we have with other improvements or extensions to font technologies. The ‘rat race of planned obsolescence’ is really a choice not to push updates to older systems, not an inevitable outcome of introducing new ways of doing things.

Nick Shinn · February 25

One-to-many mapping would also greatly simplify my work with contextual alternates.

John Savard · February 28

John Hudson said:

Burmese script shaping is in no way akin to Latin base+mark diacritic handling. The complexities involved are typical of all Brahmi-derived writing systems, and trying to accommodate them via a ‘one glyph=one codepoint’ model takes you down the obsolete Arabic presentation forms route: you simply end up with a huge number of precomposed typeforms that isn’t flexible enough to represent all of the things that can potentially occur in the script, especially when you start writing languages other than Burmese.

I will admit that where a language represents vowels by something akin to a diacritic mark associated with a consonant, there is a risk of explosive complexity, among other things.

As for writing languages other than Burmese, the lack of composing characters would indeed hamper that in the immediate term; I'm not calling for removing the ability to form characters by composition from Unicode, but for adding precomposed characters to make it possible to avoid requiring that mechanism. Of course characters for any and every language using the Burmese script would be added to Unicode posthaste; it's not as if the Cyrillic character set in Unicode can only be used to write Russian, and not also Belarusian, Ukranian, Bulgarian and Serbian. (This would even be true of minority languages in which national standard bodies had no interest.)

And there was an unofficial local standard for Burmese which was popular because it made it easier to handle the language, so at least in that case this isn't technologically infeasible. Instead, it's handling Burmese the right way which appears to be infeasible, even if it isn't really, and the appearance is only due to laziness. But it's laziness that the ordinary computer user in Burma has no control over.

John Hudson said:

Because text engines treat European languages as if its still the 1990s, so if precomposed diacritic characters are missing they fall back to other fonts or display .notdef, even if the font contains everything it needs to display the canonically decomposed sequence. This is the irony of these languages and character sets originally taking precedence over other scripts: they’re stuck in the past.

Such text engines are indeed noncompliant. I don't disagree that they ought to be fixed.

However, the users of European languages are still lucky; precisely because precomposed characters for those languages exist in Unicode, it is possible to correctly render the language with those defective text engines provided one prepares one's text correctly for them - by using precomposed characters.

And so, if every single one of the text engines with this problem for European languages still correctly rendered texts in non-European languages by composing characters for those languages, (in large part) the problem I am expressing concern about wouldn't exist.

But if that is not the case, then, basically, I'm taking this view (which, of course, is getting less and less correct with every passing day)... that essentially speakers of non-European languages often find themselves still "in the 1990s" even in the present day, what with living in poor countries, they have less access to fancy computers and have to make do with less powerful ones. Or because of software support issues, given that really cheap processors used in smartphones are powerful enough now to handle modern text rendering.

By the way, the reason I said "in large part" above is that there are other issues. When text starts to include codes for accent marks - and, worse yet, non-printing control characters which are used to correctly position things like accent marks - text itself becomes a more complex entity.

So writing a program that processes text, if it is to support such a language, becomes more complicated than writing a similar program that supports only English. I view that as bad; I want people to be able to program computers as easily as was the case back in the 1960s.

Of course, I can't do anything about the fact that Chinese characters have to be more than one byte long; while under some circumstances, using UTF-8 is not an issue, the option of using a plain 16-bit character encoding is sometimes required when such languages are supported. (Yes, I am saying that because, back in the 1960s, some programmers were able to assume that every character occupied a memory cell of exactly the same size, and text documents and character strings would contain no other kind of entity, it must be possible for programmers to still do this if they want even though they're supporting Chinese - basically because not every computer program is a commercial application that will sell thousands of copies, a program can also be a one-off thing to solve a single problem.)

John Savard · February 28

Simon Cozens said:

If the Unicode consortium isn't interested, then the countries affected adversely by this should get together, and define something new.

Include obligatory XKCD cartoon here.

could just switch

In the immortal words of Inigo Montoya, I do not think that means what you think it means.

I presume you mean this one:

https://xkcd.com/927/

This, of course, brings up another question: what would Equicode look like? To be "equal", it might be more than just a slight modification of Unicode.

In fact, it might look like this:

E+0000 to E+001F

The basic ASCII control characters

E+0020 to E+01FF

National use positions

E+0200 to E+021F

Unused (reserved)... the currency symbols block could wind up here

E+0220 to E+02FF

The part of ISO 8859-1 after the control characters

However, the "space" character may be moved from E+0220 to E+0020 in the process of defining the standard.

And then...

E+0300 to E+7FFF

Special characters, mathematical, programming, and so on

E+8000 to E+FDFF

Support for other languages

The dividing line might not be at E+8000; I'm not sure offhand how much space Unicode reserves for special characters.

Peter Constable · March 2

After watching the presentation, I haven't seen anything from the problems reported that is clearly a problem in current encoding of Greek script in Unicode.

As for the suggestion in the presentation that the encoding of Greek in Unicode was due to US-based companies taking a it's-good-enough approach, I'd point out a couple of things:

1) The encoding of Greek in Unicode was based (in part) on pre-existing ISO/IEC standards: ISO 5428:1984 Greek alphabet coded character set for bibliographic information interchange (first edition was published 1980); and ISO/IEC 8859-7:1987 Information processing—8-bit single-byte coded graphic character sets, Part 7: Latin/Greek alphabet.

2) The Greek national standards body participated in the development of the aforementioned ISO/IEC standards and in the first edition of ISO/IEC 10646, which is what determined the Unicode encoding of all characters used for modern Greek; and they are still, today, a participating member in the same ISO/IEC committee that partners with Unicode.

I'd also mention that some of the authors of the presentation have had contact for a very long time with people directly involved in maintenance of Unicode, so I would have thought they'd have known for years how to engage on Unicode encoding issues.

I'd also like to clarify something about Unicode normalization: Unicode normalization is not something that is necessarily done on text, and in general it is done only in certain contexts involving identifier systems. (It may be used in text searches, but in a temporary buffer, not changing the characters in content.) So, while it was mentioned for some of the issues that the Greek character entered from a keyboard gets normalized to a non-Greek character (e.g., 0387 ANO TELEIA gets normalized to 00B7 MIDDLE DOT), I think it's unlikely that the input methods are applying Unicode normalization, and more likely that they keyboard layouts are not providing the Greek character at all.

(Windows has a few different Greek keyboard layouts. The Greek 220 layout has a key that generates 037E GREEK QUESTION MARK, but unfortunately the other layouts—including the default layout—generates 003B SEMICOLON; and not appear to support 0387 ANO TELEIA.)

So, I think several of the problems mentioned have to do with keyboard layouts.

Peter Constable · March 2

Off topic (not about Greek)...

John Savard said:

...
Apparently Safari fails to render Unicode-compliant Mongolian-language fonts correctly.
...
If the Unicode consortium isn't interested, then the countries affected adversely by this should get together...

Sorry, but there are uniformed conjectures here. It may be true that Mongolian doesn't work well in some apps, but that's largely because the encoding was designed by certain linguists who were more interested in an encoding that reflects the historical linguistics of Mongolian than in something that is practical for implementers and end users, and that encoding was brought into ISO/IEC 10646 by the national standards body of their country. The design was so impractical that no implementer knew what "Unicode-compliant" or inter-operable actually meant in the case of Mongolian script. (The problem is far, far worse than anything mentioned for Greek.)

The Unicode Consortium actually was open to changes in Mongolian encoding to address some of the problems for users and implementers. However, the same language experts who designed the current encoding were strongly opposed to changes, and as a result only minor improvements could be made. Since fundamental improvements couldn't be achieved, Unicode has done what it can by pointing implementers at guidance for shaping engine and font developers that at least provide a basis for creating inter-operable implementations (i.e., so given Mongolian text content can display the same — i.e., with the same reading — in different apps or with different fonts).

Peter Constable · March 2

John Hudson said:

The Greek question mark is an interesting case. This also has a canonical decomposition, from U+037E GREEK QUESTION MARK to U+003B SEMICOLON.

The ATD3 presentation and the petition both suggest that this is a problem ...

But I am also wondering what a distinct form of Greek question mark would look like? When does it not have the same shape as the semicolon?

Good question. There were a few points in the ATD3 presentation at which I had a similar question: they'd say something looked wrong, but not clarify what would have looked right.

John Savard · March 2

Peter Constable said:

Off topic (not about Greek)...

John Savard said:

...
Apparently Safari fails to render Unicode-compliant Mongolian-language fonts correctly.
...
If the Unicode consortium isn't interested, then the countries affected adversely by this should get together...

Sorry, but there are uniformed conjectures here. It may be true that Mongolian doesn't work well in some apps, but that's largely because the encoding was designed by certain linguists who were more interested in an encoding that reflects the historical linguistics of Mongolian than in something that is practical for implementers and end users, and that encoding was brought into ISO/IEC 10646 by the national standards body of their country. The design was so impractical that no implementer knew what "Unicode-compliant" or inter-operable actually meant in the case of Mongolian script. (The problem is far, far worse than anything mentioned for Greek.)

The Unicode Consortium actually was open to changes in Mongolian encoding to address some of the problems for users and implementers. However, the same language experts who designed the current encoding were strongly opposed to changes, and as a result only minor improvements could be made. Since fundamental improvements couldn't be achieved, Unicode has done what it can by pointing implementers at guidance for shaping engine and font developers that at least provide a basis for creating inter-operable implementations (i.e., so given Mongolian text content can display the same — i.e., with the same reading — in different apps or with different fonts).

I will forgive the maker of the YouTube video for not knowing where the fault truly lies in all this, rather than raging at her for making me look foolish. Since Unicode, for very good reason, has a rule that characters can't be removed or redefined, this is definitely a serious issue.

Another thing she noted was that current Mongolian orthography is a mess; an older version of the Mongolian script would be a nearly phonetic script for the language, but it would be too inconvenient to change to it.

My attitude to this issue is based on the situation with Burma and Vietnam; unofficial local standards exist which make handling the languages of those countries genuinely practical, but the Unicode Consortium wants them to do stuff the hard way. That doesn't seem fair. But I realize this is a complex issue, and that there's a great deal of validity to the arguments on the opposing side; if that hadn't been the case, Unicode would never have gone down that route.

Apparently, the remedy for Mongolian lies in a different direction. I suppose we may have to wait for the emergence of a software industry within Mongolia itself for there to be someone with the will and ability to suggest genuine solutions and have the opportunity to have them heard.

John Hudson · March 2

So, while it was mentioned for some of the issues that the Greek character entered from a keyboard gets normalized to a non-Greek character (e.g., 0387 ANO TELEIA gets normalized to 00B7 MIDDLE DOT), I think it's unlikely that the input methods are applying Unicode normalization, and more likely that they keyboard layouts are not providing the Greek character at all.

There may be a distinction to be made between Unicode normalisation as a specific operation defined by Unicode, and presumptions around Unicode-defined decompositions applied elsewhere in text input, processing and display operations by software makers, including but not necessarily limited to keyboard layouts.

Digital Greek Typography is Broken

Comments

Categories