Microsoft Word and OpenType Substitions for non-Latin Sequences?

Daniel Yacob
Daniel Yacob Posts: 14
edited August 2023 in Font Technology

Greetings All,

I’ve gone down the rabbit hole of OpenType feature support by various applications and could use some help in understanding the policies that Microsoft Word applies -it appears to be the odd application out that interprets features differently.

What I am observing is substitutions (in a calt, liga, other) appear to work for a Latin symbol sequence, but not non-Latin.  Perhaps Word requires additional language configuration?  I reduced the problem to this minimal features.fea file (FontLab 8 is my design tool):

languagesystem DFLT dflt;  # I add this line only
languagesystem latn dflt;  # added by FL8
languagesystem ethi dflt;  # added by FL8
languagesystem grek dflt;  # added by FL8

feature calt {
  sub a b by x;                   # works everywhere
  sub h a by uni1203;             # works everywhere
  sub uni1200 a by uni1203;       # fails everywhere
  sub h uniFE00 by uni1203;       # fails in MS Word, works for LibreOffice & Chrome
  sub uni1200 uniFE00 by uni1203; # fails in MS Word, works for LibreOffice & Chrome
} calt;

 

In summary, I see for MS Word the following is ok:

sub <latin> <latin> by <any>;

but not:

sub <latin> <non-latin> by … ;
sub <non-latin> <non-latin> by … ;

and universally failing is:

sub <non-latin> <latin> by … ;

I would like to understand why these fail and what missing OT statement MS Word is in need of.  To be fair, I haven’t tested with Cyrillic, Greek, etc. , so “non-latin” is limited to Ethiopic, Sequence Variants, and PUA symbols in my trials.

Any help is appreciated!

Thanks,

-Daniel

Comments

  • Shaping engines divide a text string into separate segments. These segments can only hold one Script. The OpenType layout features are processed per segment, so therefore mixed scripts will never work.

    The shaping engine used by Microsoft Word is outdated, so not all features work as they should.

  • Thank you @Erwin Denissen, that explains the scenario that "fails everywhere", which wasn't a true use case that I have.  The Unicode Variation Sequence code points should be treated as script-independent though, this may be an area where the Word shaping engine is outdated and the others are more current.
  • John Hudson
    John Hudson Posts: 3,244
    Unicode variation sequences are not supposed to be handled via OTL GSUB. They are pre glyph processing substitutions made at the cmap level using a format 14 cmap subtable. I don’t know if there is any way to generate a format 14 cmap subtable from within typical font development tools: I use DTL OTMaster to hand code mine, like this:


  • You can use FontCreator to add Unicode variation sequences.



  • Fontmake or ufo2ft will use it when generating the OpenType files. Other tools might do so as well.
  • Thanks @John Hudson, @Erwin Denissen, and @Denis Moyogo Jacquerye, the responses are eye-opening and fill in a critical knowledge gap on my end.  It's great to see that there are a number of ways to tackle the problem.  I look forward to trying them this week.  

    thanks again!
  • bdevos
    bdevos Posts: 5
    While OpenType shapers do divide up text by script, what script a character is has some complications. Some characters have script Common or Inherited, or are listed in ScriptExtensions, and can therefore be included in text segments with other scripts. Not all applications handle this extra complication.

    Word (on Windows) uses DirectWrite for text shaping. Notepad also uses DirectWrite, and for testing DirectWrite I would use both applications. I have encountered situations where Word would not apply liga by default (but could be enabled by the user), but would apply rlig by default. Notepad applied both with no user interaction.

    From reading the Variation Sequences FAQ you need to be using a variation sequence that is known by Unicode.
  • John Hudson
    John Hudson Posts: 3,244
    edited August 2023
    Not all applications handle this extra complication.

    And those that do may not handle it consistently. There is no formal specification for how to perform OTL script itemisation and run segmentation, and different software makers have implemented it without common agreement. So, for example, I have found different results for script=Common integration into runs in Microsoft and Adobe shapers.

    Really, this is something for which a standard algorithm is needed, one that would account even for edge cases such as adjacent sequence of different scripts with a script=Common character between them.
  • I greatly appreciate all the help and insight today.  I gave it a try with fontmake and FontCreator, skipping DTL OTMaster for now since I didn't see a trial version.

    Adding a couple of VS entries into the UFO lib.plist, building and installing the font.  The VSs were accepted by MS Word!  Unfortunately, FL8 was not including color data in its UFO export so that became a new obstacle (I reported the issue to tech support, this may simply be an export limitation).

    With FontCreator, I could open the COLR OTF file, add the VS mappings, and this finally worked as desired in Word.  I'll test thoroughly during the week, at the moment it appears that I have a viable workflow.

    thanks, again!
  • bobh
    bobh Posts: 1
    As @bdevos notes, Unicode says, in answer to the question What variation sequences are valid?  Only those listed in StandardizedVariants.txtemoji-variation-sequences.txt, or the registered sequences listed in the Ideographic Variation Database (IVD).

    I cannot find either "h uniFE00" or "uni1200 uniFE00" from the original question in any of the above references, though maybe I missed them. Are these the sequences you found were accepted by MS Word?

    Also, as for 
    sub <non-latin> <non-latin> by … ;
    not seeming to work, I can confirm that in Word 2016 on Windows 10 such sequences do work for Arabic.
  • Thomas Phinney
    Thomas Phinney Posts: 2,901
    edited August 2023
    I think you need to be more specific about "non-latin"
    Are they the same non-Latin writing system?

    e.g.
    sub "non-latin-A" "non-latin-B" by ...
    is expected to fail, but

    sub "non-latin-A" "non-latin-A" by ...

    should work
  • @bobh , those two sequences failed in MS Word.  The 2nd case I did get accepted by word when I moved the substitution into a CMAP table. 

    Regarding the observation by @bdevos , I found that Word and other apps, and the font rendering stack, are not enforcing the definitions in the StandardizedVariants.txt, etc. , files as a permittable set.  I was able to add my own custom Variation Sequences -fortunately (in the cmap).  I interpreted those files as more of a reference for font vendors who want to support variation sequences.

    @Thomas Phinney , it was the 1st, mixed-script case that was failing. I hadn't tried the 2nd case, it seems more sensible that it should work.
  • John Hudson
    John Hudson Posts: 3,244
    those two sequences failed in MS Word.  The 2nd case I did get accepted by word when I moved the substitution into a CMAP table. 
    I note that you were experimenting with these lookups in the calt feature, which would be dependent on the shaping engine applying that feature by default in Word. A more broadly reliable, and better suited feature to use, would be ccmp. However, the cmap 14 subtable mechanism is definitely the better bet for variation selector sequences, since that is what it was specified for, and bypasses shaping engine dependencies.

  • Can someone mention some fonts containing cmap format 14 tables, so that people can see them in action? I'd like to see both default and non-default UVS tables, if possible.
  • Here are some in the Google Fonts catalogue:

    ofl/mplus1p
    ofl/notoserifsc
    ofl/notosanstc
    ofl/notoserifhk
    ofl/padauk
    ofl/notosanshk
    ofl/notoseriftc
    ofl/notosanssc
    ofl/bizudgothic
    ofl/bizudpgothic
    ofl/bizudpmincho
    ofl/bizudmincho
    ofl/notosansjp
    ofl/notoserifkr
    ofl/notosanskr
    ofl/notoserifjp
    ofl/stixtwomath
  • Thanks, Simon! These should keep me out of trouble for a while.