Digital Greek Typography is Broken

… says this petition:

https://www.openpetition.eu/gr/petition/online/digital-greek-typography-is-broken-improve-standards-and-demand-fixes-in-all-software

I certainly buy that there are many problems, but some things seem to be getting blamed on Unicode that are actually to do with downstream implementations.
Tagged:

Comments

  • John Hudson
    John Hudson Posts: 3,317
    edited February 21
    From that openpetition page:
    1. Prioritize a comprehensive review and revision of Unicode definitions and normalization rules for Greek, in consultation with native Greek speakers and experts in Greek typography.
    Very few of the software problems cited in the petition are due to Unicode normalisation per se, and Unicode normalisation cannot be changed due to stability agreements between standards bodies, so this seems both a distraction and a non-starter.
  • In 2002 we had similar problem with Bangla digital typography regarding Khanda Ta, a Yaphalaa . Initially unicode people told the specifications is alright but the problem is in implementation. Later when difficulties in implementing existing specifications, unicode consortium relented by changing the specifications and encoding khanda ta as a separate character. Luckily there was not much backward compatibility issues then. But for Greek any change in specifications now will have big compatibility issues. 
  • With regard to encoding, Unicode can ADD new characters. They can even deprecate existing characters… but never remove them from the specification altogether.

    But most of the Greek issues with encoding seem to be about apps (or even fonts) doing things incorrectly, when enough correct and distinct codepoints already exist.
  • Denis Moyogo Jacquerye
    edited February 20
    The ATD3 presentation is online: https://vimeo.com/1058500542#t=63:12
  • this initiative may be helpful in increasing awareness of those issues, but I doubt that this petition alone will have much effect to improvements. Foremost, I recommend to precisely define which problem arises in which scenario; and, most important, to distinguish a) encoding issues; from b) application issues, c) font bugs and d) keyboard issues. In the text of the petition, these four (!) aspects are too much mixed into each other. It won’t help to blame e.g. Unicode for a bug, lets say, in a specific font or in the source code of a language setting (–› hyphenation rules).
    As far as I can tell there are no (or no grave) bugs in Unicode for Greek. If someone thinks there are, file a proposal to be adressed to UTC directly. However, as Thomas, mentioned, stability policy demands that existing encodings will be kept unaltered.

    Only as a side-aspect: the so-far unencoded Greek Omikron-Upsilon character (rather: glyph) may get encoded in the near future.

  • Kent Lew
    Kent Lew Posts: 968
    I saw the presentation in the livestream, but want to watch it again sometime as the slides went by very quickly, and some of the cited problems deserve closer examination.
    I am sympathetic with the frustrations and support advocacy for solutions; but I also agree that it seems ire is misdirected at Unicode. The use of the term “normalization” in a broad array of issues further confuses the matter since it has a specific meaning in the context of Unicode, which is not always related to the observed problems, that I can tell (and may not align with what is actually being complained about).
    In line with Andreas’s comment, the technical locus of each issue needs to be further pinpointed in order to direct appeals and apply pressure where it is most likely to yield real, practical results.
    One issue that may be somewhat addressable by font makers is the problem of mixed fallbacks, where a π or µ from one [usually non-Greek] font is interspersed with true Greek setting from .
    And in this instance, Unicode may be somewhat culpable. I think the handling of Omega/Ohm, Delta/increment, mu/micro, and especially lack of equivalent non-Greek pairing for pi has played a role.
    The problem seems to stem from the fact that most fonts include Ω∆πµ for scientific & non-Greek application and not designed in full Greek context, and yet these are encoded as Greek rather than their non-Greek equivalents. The reason for this, I believe, has much to do with legacy codepages and also with keyboard implementations, but also seems to trace back to Unicode determinations of equivalence.
    Font makers might do some small service by not including these four in fonts that do not otherwise support Greek (at least for those destined for web use or other environments that implement fallback stacks). That way, Greek text might fall back to some consistent representation.
    This would mean that any non-Greek reference would also have to fall back, which might not be pretty. The glyphs could still be included with their non-Greek codepoints, but most non-Greek keyboards have unfortunately chosen to input the Greek codepoint (due to Unicode equivalence, I imagine), which is part of the problem and perhaps also a legitimate issue.
  • John Hudson
    John Hudson Posts: 3,317
    edited February 20
    The Unicode encoding of Greek is messy and certainly not as simple or easy to implement as it should have been. As Kent notes, it includes some equivalences between Greek letters and Greek-derived symbols that have been a problem because of choices made in implementations, notably in 8-bit codepages and the long tail of inherited dependencies. I suspect any ‘review and revision of Unicode definitions and normalization’ and establishment of ‘a standardized character set for digital Greek that supports all necessary characters, diacritics, and typographic conventions’ is going to result simply in a list of ‘Use these characters’ and ’Don’t use these characters’.

    The petition mentions ‘incorrect case conversion’. Greek casing is complicated by factors resulting from modern Greek typographic practice: some precomposed diacritic characters exist only as lowercase because they do not occur in word-initial position and in all-caps would, conventionally, lose their diacritic marks. These encodings require one-to-many, precomposed-to-base+mark conversions. Some aspects of casing—notably contextual behaviour of mark suppression in all-caps—are put onto font makers to handle at the glyph substitution level, but I think that is necessary because the suppression of marks in all-caps settings is a modern convention of Greek typography and not a consistent practice of the script. It would break correct display of many centuries of Greek text if that aspect of casing were applied at the character level in software case conversion implementations.

    I would say that Greek, as encoded in Unicode and needing to be supported at the glyph substitution level to affect correct display, meets the definition of a ‘complex script’. I suspect some of the frustration arises from the assumption—not only on the part of users but also on the part of implementing software makers—that Greek must be a simple script because it is European and alphabetical, rather than a Middle Eastern cursive abjad or Indic alphasyllabery.
  • Thomas Phinney
    Thomas Phinney Posts: 2,945
    edited February 20
    For the symbols vs Greek problem, seems like one reasonable solution would be:

    - fonts with full Greek support put the math/tech symbol characters at the correct codepoints (already happening)
    - fonts with the symbols but not the corresponding Greek put each symbol at BOTH the symbol codepoint and the Greek codepoint (a revision required), so it works in legacy environments as well as future environments
    - Most importantly, future apps/environments use the correct codepoints for the symbols, and fall back to the Greek codepoint only if the symbol codepoints are missing in a given font

    One could add an optional flag to the OpenType format, that if “on” would indicate that “hey I use the proper symbols codepoints for those Greek-like symbols”; but that would only work in a future app/environment that looked for it, in which case it could do fallback.
  • John Hudson
    John Hudson Posts: 3,317
    Thomas, some of the Greek-derived symbol characters have compatibility decompositions to Greek letter characters, i.e. one-directional decompositions that are not standard normalizations but may be applied. These reflect compatibilities in old 8-bit character sets, where e.g. Greek μ and the micro symbol used the same decimal encoding. These compatibility decompositions have long tail dependencies in software, and since they’re not wrong from a Unicode standardization perspective, there isn’t any impetus for software makers to track down those inherited dependencies and change them. I think compatibility decompositions—unlike canonical decompositions—may not be subject to stability agreements, so could perhaps be changed. Having a clean encoding distinction between Greek letters and Greek-derived symbols would be helpful; of course, it doesn’t guarantee that anyone is going to go and clean up existing code bases.
  • John Hudson
    John Hudson Posts: 3,317
    edited February 20
    The ano teleia normalization problem is a bad one. U+0387 GREEK ANO TELEIA has a canonical decomposition to U+00B7 MIDDLE DOT, which is entirely inappropriate because it conventionally sits too low to be used as an ano teleia. Because this is a canonical decomposition, it cannot be changed in Unicode. So there is always a chance that U+0387 is going to be converted to U+00B7. The issue can be addressed by a grek script locl substitution

    sub periodcentered by anoteleia;
    (followed by case and smcp substutions of appropriate height variants for all-caps and smallcap ano teleia), but that means an actual middle dot is unavailable for Greek text, which might be an issue for e.g. transcribing coin or seal inscriptons.




  • John Hudson
    John Hudson Posts: 3,317
    The Greek question mark is an interesting case. This also has a canonical decomposition, from U+037E GREEK QUESTION MARK to U+003B SEMICOLON.

    The ATD3 presentation and the petition both suggest that this is a problem because it prevents a disinct form being used for the Greek question mark. As with ano teleia, a grek locl glyph can be implemented, and should work so long as the script=common property of the semicolon means it is rolled into the adjacent Greek glyph run for OTL processing. And unlike ano teleia vs middle dot, I think there is no context in which the common semicolon might be used distinctively in Greek text.

    But I am also wondering what a distinct form of Greek question mark would look like? When does it not have the same shape as the semicolon?
  • Kent Lew
    Kent Lew Posts: 968

    - fonts with the symbols but not the corresponding Greek put each symbol at BOTH the symbol codepoint and the Greek codepoint (a revision required), so it works in legacy environments as well as future environments

    Thomas, I think you miss my point. A font with the symbol in both codepoints but not the rest of the corresponding Greek is what I believe is causing the problems of mixed font display: a specified font doesn’t support Greek so the display falls back through the stack to one that does — but, the specified font does contain Ω∆µπ Greek codepoints and so those originals get mixed in with the rest of the Greek from the fallback and don’t mesh well.

    If the font doesn’t support Greek, having only the symbols for non-Greek use would solve that problem. Except that modern non-Greek keyboards only put in the Greek codepoints in non-Greek settings, rather than symbol codepoints, so most texts don’t properly distinguish the usage. 

    Mathematicians will likely tell you that the symbols for use in math & science are the Greek letters. Thus the Unicode equivalence determinations. But that seems to be exactly what has led to this problem for the Greek users.

  • Kent Lew
    Kent Lew Posts: 968
    For the reasons you point out, John, the ire regarding ano teleia and Greek question mark is rightly directed at Unicode, precisely because of those canonical determinations. Faulting “normalization” is warranted in these cases. But as you say, the canonicality probably means that ship has sailed and unlikely to be called back to port.
  • Nick Shinn
    Nick Shinn Posts: 2,247
    edited February 20
    @John Hudson

    I suspect some of the frustration arises from the assumption—not only on the part of users but also on the part of implementing software makers—that Greek must be a simple script because it is European and alphabetical, rather than a Middle Eastern cursive abjad or Indic alphasyllabery.

    Perhaps the complexities involved in typesetting Greek for Renaissance polyglot bibles, full of cursivity, could be a benchmark.
  • John Hudson
    John Hudson Posts: 3,317
    Regarding the incorrect display of some characters with symbol forms or fallback to other fonts: this is something that needs to be examined on a case-by-case basis, because I don’t think there is a single explanation for what is happening in all environments. This sort of thing is a legacy of 8-bit encodings, which were notoriously platform-specific, and then of the process by which those 8-bit encodings were translated—again, independently on different platforms—to ‘codepages’ of Unicode codepoints. Because, to save space, Greek letters and Greek-derived symbols shared decimal codepoints in 8-bit encodings, how they were interpreted in the translation to Unicode codepages was not consistent, and the mix of canonical and compatibility mappings and unifications in Unicode reflect that: some pairs of characters are always considered equivalent, some are sometimes equivalent in some places, and some remain unified on a single codepoint (notably the π letter/symbol).

    I think the messiness of this in part reflects Unicode, at a particular point, being more descriptive than prescriptive. Software companies had done things with Greek and were doing things with Greek, and Unicode was trying to capture that and apply some flexible rationale to cover the variety of things done. This is the case for a lot of the scripts encoded early in the history of the standard: none of this mess would pass muster in a script encoding proposal today.