Opting out of the Universal Shaping Engine

RichardW · July 2022

I'm not sure whether this is the best public place to ask; the man who could best advise on this implicit subquestion is probably @Peter Constable.

A sad feature of Tai Tham font development (though Apple may laugh) is that the best way of developing a Tai Tham font, though perhaps not for the faint of heart, is to avoid the Universal Shaping Engine (USE). With the HarfBuzz renderer, this can be done straightforwardly by having no script table in GSUB or GPOS for the script in question. The shaping and positioning are then done according to the features provided for the default script ('DFLT'). This also seems to work for CoreText.

I am not clear what the behaviour is intended to be on Windows 10. It seems to have evolved over the course of Windows 10, and I would like to know what the long term intention is and what the timescales are. In Notepad and Character Map, it seems that for a font with no GSUB or GPOS for the script in question, the USE will be run using features for the default script. For MS Word of Microsoft Office Standard 2016, it seems that the USE will be run with all features considered devoid of lookups, except that features for the DFLT script can be enabled in the normal fashion for MS Word. (There are some anomalies I have not yet worked out.) It seems the dotted circle will not be inserted if not present in the font. Will the MS Word behaviour change to that of Notepad? Or will there be a mechanism to opt out of the USE?

What happens with other renderers?

John Hudson · July 2022

The first thing to note is that this is not specified anywhere.

I like the idea that the absence of a particular script tag applies DFLT script shaping to characters regardless of Unicode script property. This provides a mechanism to bypass the directed shaping path for the characters. It seems to me unproblematic, and something that should be standardised so that developers know to do this.

But is it really the case that this is the best option for Tai Tham? Isn’t the better option either to revise USE to include some cluster model exceptions for Tai Tham, or to specify and implement an independent Tai Tham shaping engine?

RichardW · July 2022

Of course one problem is that the Windows default for shaping seems to be to disable all shaping for a non-'standard' script unless it is under the heel of a shaping engine. However, it is possible that we have overlooked the feature rlig.

One problem is that Microsoft seems unwilling to fund the changes, and the changes HarfBuzz has made on its own are too timid. Also, the HarfBuzz USE code has been locking against it. The core change needed is to make it work is to replace a basic core of C{HC} by a core of C(V){HC(V)}. I've been looking at the code changes needed, and they're looking dramatic, despite their conceptual simplicity. Further complexity arises if we try to stop them applying to other scripts. For example, having privileged SAKOT to behave like U+17D2 KHMER SIGN COENG, a behaviour always pointed out, it may be necessary to similarly privilege Tai Tham subscript consonants. (Unsurprisingly, Mon U+1039 MYANMAR SIGN VIRAMA also attaches to vowels in the natural order. The Tais got Tai Tham from the Mon, and then to support the Tai vernacular, enriched it with a descendant or close relative of the Sukhothai script.) There seems to be a failure to recognise that the dozen or so consonant signs of Tai Tham are basically special cases of HC or CH. We were greedy and wanted Thai Tham in the BMP. It might have been better to accept the SMP and go for the Tibetan model. HarfBuzz allows a subscript consonant to follow a vowel if it is encoded using SAKOT (the invisible stacker), but not if it is encoded using a single codepoint!

RichardW · July 2022

I've taken another look at the HarfBuzz code, and it's the only the handling of the non-conformant clusters that gets complicated. That could be fixed in slower time, so it looks as though I can still roll my sleeves up and produce a demonstration version of how to do USE. HarfBuzz USE already has one special class for Tai Tham (and also a special class for dual purpose virama/vowel modifier), so adding a couple more won't hurt much. I forgot that the encoding proposals had MAI KANG as both vowel and final consonant, and I think something like that is the best way of handling needing both the sequences <MAI KANG, SIGN AA> and <SIGN AA, MAI KANG>.

bdevos · August 2022

The issue with DirectWrite not following the shaper specified in the font occurs, if I understand Peter correctly, even if the primary script does not get routed to the USE. For the font in the GitHub issue, the scripts listed in the FEA source are DFLT and tml2. The nukta, which is in a different script block, then becomes in a different script run with DirectWrite.

I have seen the original issue of a script (Limbu in this case) being routed to USE in DirectWrite, even if the font only has DFLT.

John Hudson · August 2022

Fonts do not specify shapers. Text path to shaping engine is not determined by either OTL script tag or Unicode block. It is determined by the Unicode script property for each character. So all characters with the property script=Limbu will be processed in runs that are passed to the Universal Shaping Engine.

The Grantha bindu below, which is used as a nukta with Tamil script, has the property script=Inherited, meaning that it should be combinable with any base character from any script. So it doesn’t matter that it is in a different Unicode block from Tamil, and the issue in DirectWrite is not architectural but just indicates that the Tamil shaping engine has not been updated to accommodate this character.

[Script itemisation and run segmentation is performed using unstandardised and/or undocumented algorithms. For characters with particular script properties, this process is pretty obvious, but the handling of adjacent characters with the property script=Common is not standardised.]

RichardW · August 2022

John Hudson said:

Fonts do not specify shapers. Text path to shaping engine is not determined by either OTL script tag or Unicode block. It is determined by the Unicode script property for each character. So all characters with the property script=Limbu will be processed in runs that are passed to the Universal Shaping Engine.

Actually, in many cases they do. For a Devanagari script run, HarfBuzz offers at least a choice of V1 Indic shaping engine (script tag "dev "), V2 Indic engine (script tag "dev2") and USE (script tag "dev3"). Likewise, for Tai Tham, it offers a choice of USE (script tag "tham") and entirely user-defined (script tag "DFLT"), and similarly for Limbu. The Windows system is more restrictive - it seems that of these three scripts, only for Devanagari is there a choice, between Versions 1 and 2 of the Indic shaper.

John Hudson · August 2022

Actually, in many cases they do. For a Devanagari script run, HarfBuzz offers at least a choice of V1 Indic shaping engine (script tag "dev "), V2 Indic engine (script tag "dev2") and USE (script tag "dev3").

That’s still not technically specifying shapers at the font level: rather it is supporting different shaping models and it is still up the the shaping engine to decide which to apply. HarfBuzz happens to do one thing; DirectWrite does another; CoreText and Adobe do something else.* The point of the different Devanagari tags—only two of which are standard—isn’t to specify what shaper to use, but to be able to support different shapers within the same font.

* Adobe is responsible for the DFLT script tag concept, but to my knowledge they never really documented what they do with it. Microsoft, as I recall, didn’t really see the point of it. HarfBuzz has implemented a use for it.

RichardW · August 2022

But if only one of "dev " and "dev2" (and no "dev3") is specified, doesn't that force the rendering engine to use the corresponding shaper if the engine supports it. Admittedly that makes the font more difficult to use, as one has to know what shaper capabilities to expect.

John Hudson · August 2022

But if only one of "dev " and "dev2" (and no "dev3") is specified, doesn't that force the rendering engine to use the corresponding shaper if the engine supports it.

Sure, but that just means you are restricting input to the system, not that the system is designed with the intent of you being able to tell it what to do.

Jongseong Park · May 2024

What is the latest word on how shaping is handled for Tai Tham on Windows 10 in NotePad and MS Word? I see that these applications aren't able to attach subscripts to the vowel sign AA, e.g. rendering ᩅᩥᩉᩣ᩠ᩁ like this:

Would this be because DirectWrite passes Tai Tham to USE?

The same text is displayed correctly in Chrome, Edge, and Firefox – would that be because they use HarfBuzz, and also Graphite in the case of Firefox? The font I'm testing is Payap Lanna from SIL, and I do notice that it renders differently according to the browser. Here is ᨡᩬ᩠ᨿ᩵ in Chrome:

Here it is displayed correctly in Firefox:

As Payap Lanna is published by SIL which also developed Graphite, I'm guessing it was designed to be rendered with the latter.

I recently made a video on the Tham Lanna script, the version of Tai Tham used in northern Thailand, and promised to follow up with another video discussing its use in computerized environments. I don't need to go into too much detail, but I plan to mention which shaping engines are used in these different applications and would like to get it right.

Opting out of the Universal Shaping Engine

Comments

Categories