Disabling ccmp by default.

I thought some of the discussion during Behdad's YouTube statement 'Font Wars 2.0: The Untold Story' was worth capturing.  I have added additional information.

It was stated by Adam Twardoch that both rvrn and ccmp are supposed to be "on by default" and processed "early", before script shaping.

It was then pointed out that ccmp processing is disabled by default for the default script in pre-HarfBuzz MS Edge, which messes up fonts that use it to achieve some fundamental shaping, so that they work as well as the font permits.  There is a Tai Tham example available at http://wrdingham.co.uk/lanna/renderer_test.htm, with the font "A Tai Tham KH".  Select the font to see the examples.  (I think modern IE also shows the same pre-HarfBuzz behaviour.)  It should be noted that the non-application of ccmp pertains to 'non-standard scripts'; standard scripts may behave differently.  Please note that this font has since been updated beyond what is displayed on the site and that the font was targeted on Tai Khün, and apparently not on Northern Thai or Pali.  It has some pleasing clever behaviour that depends on Tai Khün spelling rules for its implementation.

Adam Twardoch further noted that executing "ccmp" is essential to some functionality later on in the font (like contextual alternates) but if ccmp is not executed, the rest of the shaping does not work.  ccmp was intended as "after Unicode-to-simple Glyphs cmap resolution, do some 'splitting' and 'joining' so that a different run of glyphs can go into further features". There was a "promise" (intention) that ccmp would be on by default. But seems old Edge turned it off for speed reasons, like Chrome not rendering SVG :) — so there was no reliable "early, on by default" feature and thus rvrn was that ccmp would be on by default. But seems old Edge turned it off for speed reasons, like Chrome not rendering SVG :) — so there was no reliable "early, on by default" feature and thus rvrn was added.

When ccmp is disabled by default, so also are liga and calt.  However, the fonts that use cmp for basic shaping usually don't make use of downstream substitution features.  These font are generally written on the principle, 'Give me access to a substitution feature, and I will define the shaping.'

Comments

  • Peter Constable
    Peter Constable Posts: 206
    edited September 2020
    I would find it definitely surprising if ccmp were disabled in any shaping engine on any platform. Some applications would by-pass shaping for performance reasons if certain heuristics were true—e.g., a script (like Latin or CJK) for which shaping isn't absolutely essential, no discretionary features applied... But when text is shaped, ccmp should always be processed, and processed very early.
  • rvrn is fine as a bucket for variation-specific substitutions that one wants to be processed early, to set up input for subsequent GSUB. The alternative is to to put all the early variation-specific lookups into ccmp, but that begins for feel like overloading in a layout model that ostensibly defines features by function. [And I'm not saying that is a necessary model: a huge number of OTL features are really redundant if one approaches layout using a generalised model based on lookup ordering.]

    The problem with rvrn is how it has been communicated to type designers and the incorrect impression that some people have that variations-specific substitutions must go in the rvrn feature. Variations-specific lookups can go into any feature. Of course, tracking GSUB output and input across a complex multi-dimensional design space can get really complicated pretty quickly, and an argument can be made for doing as much as possible either as early as possible or as late as possible.
  • I see another problem with rvrn.  At what stage is it executed for a complex script?  I looked through the USE specification, and found no mention of it.  That implies to me that it would be executed in the final GSUB stage along with the like of blws, which is in direct contradiction of the feature description, which implies it should be executed before any other shaping is done.

  • I would find it definitely surprising if ccmp were disabled in any shaping engine on any platform. Some applications would by-pass shaping for performance reasons if certain heuristics were true—e.g., a script (like Latin or CJK) for which shaping isn't absolutely essential, no discretionary features applied...
    That may be the historical root of the problem.  If one is to render Latin text that is not in form NFC, one needs shaping to implement soft-dottedness.  (NFC makes it harder to come up with uncontrived examples of need - CLDR 'ISO 11940' transliteration is the most obvious with ได้ > dị̂.) Consequently, ccmp seems to have be enabled for 'standard scripts', but I haven't done a detailed study.  At one staɡe, I then found that for at least some combinations of Windows and browser using native rendering, ccmp was enabled by default for Latin but could not be enabled for Tai Tham (at least, not from HTML)!  This is still the situation for Windows 7 and IE 11.  It seems that as Tai Tham requires reordering but Windows 7 does not provide it, Tai Tham is not allowed shaping by substitution!

    I forget the precise situation with Windows 10 with IE 11 and natively rendering MS Edge - I think A Tai Tham KH worked in IE11 when ccmp was explicitly initialised by the font definition in the HTML, but with MS Edge one got reordering both from the renderer and the font, which works badly for connected text.  Unfortunately, I don't currently have Windows 10 with access to my pages on the Internet to test it out, let alone a version of MS Edge with Windows shaping.

  • I see another problem with rvrn.  At what stage is it executed for a complex script?  I looked through the USE specification, and found no mention of it.  That implies to me that it would be executed in the final GSUB stage along with the like of blws...
    rvrn was defined after the USE spec was published, and the latter has not been updated yet. Whether USE implementations have been updated is a question for the three implementers.
  • Confirmation from HarfBuzz and Apple that, yes, rvrn is processed up front in their USE implementation (actually, before the USE spec processing), and I'm guessing the same is true in Microsoft's, since they're the ones who wanted rvrn in the first place.
  • When you know what you're looking for, the Harfbuzz source code is actually not too difficult to read on this sort of thing.

    Look for "hb_ot_shape_collect_features" in this file; that is the general-purpose shaping engine and the features that it processes, in order (with pauses between blocks). Then between the magic "HARF" and "BUZZ" features you have the complex-shaper-specific processing. For e.g. USE, you look for "collect_features_use" in the USE shaper implementation

    At any rate, you can easily confirm that the general-purpose shaping engine runs rvrn before anything else, in all cases.
  • So that looks like an issue to raise against over a dozen Microsoft script-specific specifications.  Or is there a more central point to raise the matter?

    One worry with the Microsoft engines is that they probably need to be changed individually.
  • One of the points that contacts at both Apple and HarfBuzz made is that rvrn is processed before strings are passed to USE for shaping. That may also be the case in the Microsoft implementation. So rather than thinking of all script shaping specifications needing to be updated, what is really needed is a higher level specification that captures all the stages of text processing from script itemisation and run segmentation through a multi-stage OTL processing model, of which script shaping is only one part.
  • RichardW said:

    If one is to render Latin text that is not in form NFC, one needs shaping to implement soft-dottedness... 

    This is still the situation for Windows 7 and IE 11.  It seems that as Tai Tham requires reordering but Windows 7 does not provide it, Tai Tham is not allowed shaping by substitution!

    I forget the precise situation with Windows 10...


    If an implementation is trying to make perf enhancements that avoid shaping, I would expect any combining mark in a string to kick into a shaping path.

    Windows 7 certainly did not support Tai Tham. I can't speak to Windows 10: it would be implemented by USE, which I never worked on.
  • Tai Tham was intended to be shaped by USE, but there are issues in Tai Tham that break the USE cluster model. This is a known issue that has been discussed by script experts and implementers several times over the past few years but, to my knowledge no action has yet been taken.
  • Does ccmp now get applied to the PUA in the main purely Microsoft systems?

    I'll answer the remarks on the rendering of Tai Tham, though that's really a separate matter.  There's slow progress.  HarfBuzz now allows 'work' (ᨠᩣ᩠ᩁ, CVC) but still not 'reason' (ᩉᩮ᩠ᨲᩩ, CVCV).  It all takes time (often ᩅᩮ᩠ᩃᩣ, also CVCV). We're now almost up to the point that should have been reached if Khmer has been taken into account when devising the USE.  Peter Constable's request for examples of the different types of Tai Tham akshara was in vain, for they were just ignored when the USE was designed.  The USE should also have been compatible with Myanmar.  Oh, and it really needs the above and below concession granted to Tibetan for non-normalising renderers, but there seems to be a successful conspiracy to monopolise the manufacture of keyboard mappings in progress.

    (The northern merger of a vowel above and a tone mark causes trouble if the competing characters are not to stored in the same place.  The encoding proposals happily allowed them to be stored in the same place.)

    The other noticeable issue is that the USE misunderstands the concept of 'medial consonant' - which will largely be solved by treating the medial consonants as subscript consonants.  It may be necessary to deprecate MEDIAL RA - <SAKOT, RA> could be used instead, as there is strong opposition to allowing phonetic order for the word that justifies their disunification, ᨯᩪᩕᩣ /duːhaː/ 'pay attention to me'.  It looks as though it were spelt <druːːaː>, but if we must spell it in that fashion, shaping rules can sort out the rendering without resorting to MEDIAL RA.

    You, John Hudson, implied that there was some need for the USE to have a 'deterministic regular expression'.  Are you confident of that, because Andrew Glass was not aware of any such constraint when we discussed the matter, and the South East Asian shaper seemed to be working well without one before it was murdered to make way for the USE.  What did you mean by a 'deterministic regular expression'?  There seem to be several such similar but distinct concepts in the mathematical literature.

    I note that having added U+103D MYANMAR CONSONANT SIGN MEDIAL WA to allow a distinction from Sanskrit <VIRAMA, MYANMAR LETTER WA>, renderers are now refusing to render the latter!  (The distinction is that one may be vaguely triangular, but the other may not.)




  • You, John Hudson, implied that there was some need for the USE to have a 'deterministic regular expression'.  Are you confident of that, because Andrew Glass was not aware of any such constraint when we discussed the matter, and the South East Asian shaper seemed to be working well without one before it was murdered to make way for the USE.  What did you mean by a 'deterministic regular expression'?  There seem to be several such similar but distinct concepts in the mathematical literature.

    Where are you quoting that statement from? I don't recall ever using the phrase 'deterministic regular expression', and I'm not sure what it means.
  • Sorry, I'm misremembering the 2016 Typo Labs presentation "Making fonts for the
    Universal Shaping Engine ".  The phrase which you used was, "which uses subjoined consonants in ways that may compress multiple syllables into a single cluster, causing recursion in cluster analysis".  Iteration, of course, already occurs, so there's actually no reason for a problem given there.


  • At that time, I was aware that there was an issue with Tai Tham in USE, but I wasn't familiar with the details as it isn't a script which which I had any experience. So I asked Andrew Glass to explain to me how Tai Tham broke the USE cluster model in succinct terms that I could mention in my presentation. As far as I recall, the wording of those sentences were taken pretty much verbatim from his description. My understanding was that Tai Tham could effectively join two clusters into a single graphical unit, and USE wasn't designed to be able to analyse such a structure.
  • Yep, that's been the biggest problem.  It seems to go back to late Brahmi, and was quite prominent in Khmer (a survival is documented in TUS) and Lao, where U+0EBD LAO SEMIVOWEL SIGN NYO survived the first wave of reforms.