How to avoid Latin mark to letters combination fallback to legacy codepoints?

aminabedi68 · March 2022

Hello

I'm developing a new version of a typeface which has a wide Latin character support in both ways:

legacy registered codepoints (like: U+00E0=à)

opentype mark to combine marks and base letters (like: U+0061=a and U+0300=gravecomb)

I started to test both ways inside browsers and apps, and realized that mark and base letters fallback to legacy codepoints! The font has no problem but I don't know if all things is correct in my tests or not, I'm looking for a CSS property or browser flag to disable the fallback. Have you ever face this problem? What is your solution?

John Hudson · March 2022

There are a couple of different levels at which this can occur:

1. Unicode NFC normalisation, which maps from decomposed sequences to precomposed characters at the text level (NFD normalisation maps the other way). Normalisation may be applied to the stored text or in a buffered layer between the stored text and the displayed text, e.g. as part of sorting or searching operations.

2. Text layout engine font interacton, mapping from decomposed to precomposed when selecting which glyph(s) to paint at the font cmap level. I think this was originally implemented in layout engines in the interests of downstream processing efficiency, and to avoid broken diacritics in the days before reliable mark-to-base positioning, but it means that font makers are burdened with the legacy requirement to include precomposed glyphs for those diacritics that have Unicode encodings.

There’s basically nothing that font makers can do about this. Even if text is normalised to decomposed strings using NFD, layout engines may still map to the precomposed glyphs at the cmap level. If you want your fonts to always display the decomposed glyph sequences, then you actually have to decompose them at the glyph level in the font ccmp feature, e.g.

agrave -> a gravecomb

Thomas Phinney · March 2022

Hmmm. Well, yes, if Unicode had developed in a different order and without backwards compatibility guarantees, U+00E0 as a precombined à would not exist. I am a bit reluctant to call it a “legacy” codepoint, but I understand why you may think of it that way. But...

AFAIK there isn’t a way to disable that. Which is to say, many layout engines choose to combine mark and base, if the combined thing is supported in the font. Unicode says they are canonically equivalent, and the layout engine might reasonably think the layout should be at least as good and possibly better, so….

I will share what I have done, even though I imagine there are some better solutions.

One thing you can do instead is test your base characters and anchors via combinations that do NOT in fact have precombined codepoints.

That is, test the anchoring point on the “a” by testing it in combination with diacritic characters with which “a” has no precombined form supported in your font.

Similarly, test the anchor point on the acute diacritic by testing it in combination with letters with which it has no precombined form supported in your font.

aminabedi68 · March 2022

(sorry for delay)

yep! it seems there is no access to turn it off.

i agree, decomposed/precomposed are better words.

and yes, ccmp could made it possible to see the combination of decomposed glyphs(i saw that in another project) but this time you can't access to precomposed glyphs!(but maybe it works with turning off ccmp...). testing inside font creation software is another way but It take a long time that way.

thank you.

Adam Jagosz · March 2022

You can prevent the re-composition by inserting a zero-width non-joiner. E.g. with this bit of JavaScript:

x='Ấ'.normalize('NFD').split('').join('\u200C')

But mark combining seems to only work this way in Firefox. Screenshot (Alegreya Sans):

Image: https://us.v-cdn.net/5019405/uploads/editor/0c/no5jdwvwy2ox.png

aminabedi68 · March 2022

wow! it's working!

(first one is precomposed glyph: Ą , and the second is decomposed glyphs: A and ogonek)

Image: https://us.v-cdn.net/5019405/uploads/editor/ta/v273x7978fj3.png

thank you so much

How to avoid Latin mark to letters combination fallback to legacy codepoints?

Comments

Categories