Mapping a Unicode range to another

Benjamin Steiner · July 2017

Hi everyone,

I am not a font professional -- using an existing font and trying to map a Unicode range to another (Syriac font, the Hebrew Unicode points mapped to Syriac glyphs).

In FontForge, for each Hebrew alphabet Unicode point, I add an "Alternate Unicode Encoding" in Glyph Info > Unicode and the previously empty Unicode points now display the correct glyphs.

The problem is, the Substitutions are not applied, i.e. ligatures between adjacent glyphs etc. They are correctly applied when the original Unicode points of the glyphs are entered, but not the new "alternate encoding" (Hebrew) Unicode points.

Am I doing something wrong? I thought these Substitutions depended on glyphs, not Unicode points, so not sure why it doesn't work.

Thanks.

Benjamin

John Hudson · July 2017

To confirm I understand, you are mapping Syriac glyphs to Hebrew Unicode codepoints in the font cmap table? I can guess why you might be doing this, but it isn't really a good idea.

The first step of OpenType Layout processing is for software to itemise runs of text according to Unicode script property of the characters, and then to pass the itemised runs to the appropriate shaping engine. So when you have a sequence of Hebrew Unicode characters, these are going to be identified as Hebrew by the software, and passed to the Hebrew shaping engine. This means that any OpenType GSUB and GPOS lookups you want to be applied to the glyphs mapped from those characters a) need to be associated with the <hebr> script tag, and b) need to be associated with layout features that are processed by the Hebrew shaping engine.

The biggest difference between Hebrew and Syriac shaping engines is that the latter does joining property analysis on the text and applies associated shaping features <init> <medi> <fina> etc.. The Hebrew shaping engine does not do this kind of analysis and does not apply those features, because Hebrew is not treated as a joining script by Unicode.

So it is difficult to get Syriac shaping to happen on Hebrew characters.

Really, this sort of thing should be happening at the text processing level, not the font level. i.e. if you want to be able to display Hebrew text in Syriac, you should actually convert the text to Syriac characters using a macro of some kind.

Michel Boyer · July 2017

John

Why is it not possible to make the substitution at the font level, for instance as a stylistic alternate, and then let the shaping engine do its job with the resulting Syriac characters? I just tried and that did not work.

Benjamin Steiner · July 2017

Thanks a lot, John.

I read up on 'init', 'medi' and 'fina' tags and ArabicShaping.txt's definition of which characters have Dual_Joining (hence 'medi' is applied) etc.

But I am having the same thoughts as Michel's: doesn't Opentype support some kind of pre-transform function of Unicode points?

This all happens inside one font: after they leave the text processing application, the sequence of Hebrew Unicode points would be pre-transformed into Syriac, and this sequence of Syriac Unicode points would then be treated as the normal input to the font.

Khaled Hosny · July 2017

There is no such transformation in fonts. If the underlying text is Hebrew then it will be processed as Hebrew, no matter what glyphs you put in there.

John Hudson · July 2017

But I am having the same thoughts as Michel's: doesn't Opentype support some kind of pre-transform function of Unicode points?

No. OpenType works in glyph space. Text is encoded in character space. There are two interfaces between character space and glyph space: the font cmap table, which maps characters to their default glyphs; and the shaping engine, which activates some of the glyph features in the font based on analysis of the text string. There is no mechanism by which you can encode text as Hebrew and tell it to behave like Syriac, because Syriac shaping behaviour is based on the text being encoded as Syriac characters. If you want Syriac shaping, you have to provide Syriac characters to the shaping engine; if you provide Hebrew characters, you're going to get Hebrew shaping.

A glyph is just an index in a font. The shaping engine has no idea what the glyph looks like or whether it's shape is Hebrew or Syriac. The shaping engine is entirely dependent on the Unicode script property of the character in the text, and the mapping of the character code to a glyph index in the font cmap table. What comes after — the OpenType Layout features substituting and positioning glyphs — follows from what the shaping engine understands the text to be, and the path from the character code to the glyph index and through the layout features. But that path always begins with script itemisation, and once the shaping engine determines that the characters are Hebrew, that is going to determine how the runs are shaped, and there's nothing you can do in the font to tell the engine 'No, these are really Syriac!'

Simon Cozens · July 2017

OpenType works in glyph space apart from the ccmp feature, which alters the character to glyph mapping. I don't know if that happens after the shaper has made inferences about the text (I suppose technically it shouldn't) and it certainly isn't meant for this kind of thing but it might be fun to try.

Khaled Hosny · July 2017

ccmp has no effect on character to glyph mapping, it works in glyph space like any other feature (the only special thing about it is that shaping engines should apply it early on in the process).

Khaled Hosny · July 2017

BTW, this kind of thing can be done with Graphite (and probably AAT) fonts since all the shaping logic is in the font and the shaping engine is not script-aware at all, but it is still not something you should be doing at font level.

Simon Cozens · July 2017

Khaled Hosny said:

ccmp has no effect on character to glyph mapping, it works in glyph space like any other feature

I'm sure you're right, but that means the OpenType Spec is very misleading:

Tag: “ccmp”
...Function: To minimize the number of glyph alternates, it is sometimes desired to decompose a character into two glyphs. Additionally, it may be preferable to compose two characters into a single glyph for better glyph processing. This feature permits such composition/decompostion[sic]...
Recommended implementation: The ccmp table maps the character sequence to its corresponding ligature (GSUB lookup type 4) or string of glyphs (GSUB lookup type 2)...

Khaled Hosny · July 2017

I think whoever wrote that description really got confused or have a pre-OpenType definition of character (which was synonym to glyph)

Michel Boyer · July 2017

I see from Adobe's Opentype Layout Engine specs that

1. All glyphs in the client's glyph run must belong to the same language system (Glyph sequence matching may not occur across language systems.)

Without that restriction, I don't see why declaring

<p>lookup ss01lookup {</p>
<p>&nbsp; lookupflag 0;</p>
<p>&nbsp; &nbsp; sub \u05D0 by \U0710 ;&nbsp; &nbsp; # ALEF -> SYRIAC LETTER ALAPH</p>
<p>&nbsp; &nbsp; sub \u05D1 by \U0712 ;&nbsp; &nbsp; # BET -> SYRIAC LETTER BETH</p>
<p>&nbsp; &nbsp; ...</p>
<p>} ss01lookup;</p>
<p><br></p>
<p>feature ss01 {</p>
<p>&nbsp; script DFLT;</p>
<p>&nbsp;&nbsp; &nbsp; language dflt ;</p>
<p>&nbsp; &nbsp; &nbsp; lookup ss01lookup;</p>
<p>&nbsp; script hebr;</p>
<p>&nbsp;&nbsp; &nbsp; language dflt ;</p>
<p>&nbsp; &nbsp; &nbsp; lookup ss01lookup;</p>
<p>&nbsp; script syrc;</p>
<p>&nbsp;&nbsp; &nbsp; language dflt ;</p>
<p>&nbsp; &nbsp; &nbsp; lookup ss01lookup;</p>
<p>} ss01;</p>

and then replacing syrc by hebr (or adding hebr) in all the feature declarations would not give a working font unless it is somewhere else specified that the features specific to arabic, syriac etc may not be given the script tag hebr.

Khaled Hosny · July 2017

This won’t work since the engine basis its decision whether or not to do Syriac specific processing on the characters not the glyphs. The glyphs can be whatever they want, if the characters are Hebrew then this is Hebrew for all the engine knows.

André G. Isaak · July 2017

lookup ss01lookup {
&nbsp; lookupflag 0;
&nbsp; &nbsp; sub \u05D0 by \U0710 ;&nbsp; &nbsp; # ALEF -> SYRIAC LETTER ALAPH
&nbsp; &nbsp; sub \u05D1 by \U0712 ;&nbsp; &nbsp; # BET -> SYRIAC LETTER BETH
&nbsp; &nbsp; ...
} ss01lookup;

I think that the source of confusion here may reside in the (pseudo) syntax used above, where you appear to be identifying glyphs by the unicode values of their associated base characters.

GSUB tables deal exclusively with glyph IDs, not with unicode values, so even if you write a substitution which *appears* to change the underlying character, it really does no such thing -- it simply replaces one GID with another leaving the underlying character (and hence unicode value) unchanged.

As an example, consider the following (rather pointless) feature:

feature ss01 { # ROT-13
sub [A B C D E F G H I J K L M N O P Q R S T U V W X Y Z] by
[N O P Q R S T U V W X Y Z A B C D E F G H I J K L M];
} feature ss01;

This would implement ROT-13 within a font and applying this feature would result in text which looks like gibberish.

So, for example, "THE QUICK BROWN FOX" would be rendered as "GUR DHVPX OEBJA SBK".

However, if you were to apply this feature and then run your spell checker, it wouldn't find any errors because the applications program would still see this as 'THE QUICK BROWN FOX'. Similarly, in your example above, you can map alef to alaph, but anything outside the font (including the shaping engine) is still going to see this as alef (U05D0). All of the substitutions performed by your GSUB table take place after the shaping engine is already done its work.

André

Michel Boyer · July 2017

André G. Isaak said:
lookup ss01lookup {
&nbsp; lookupflag 0;
&nbsp; &nbsp; sub \u05D0 by \U0710 ;&nbsp; &nbsp; # ALEF -> SYRIAC LETTER ALAPH
&nbsp; &nbsp; sub \u05D1 by \U0712 ;&nbsp; &nbsp; # BET -> SYRIAC LETTER BETH
&nbsp; &nbsp; ...
} ss01lookup;
I think that the source of confusion here may reside in the (pseudo) syntax used above, where you appear to be identifying glyphs by the unicode values of their associated base characters.

The syntax is the one used by FontForge for Adobe feature files and was read by FontForge. The font I tried this with had those glyphs names. The names U07xx were in the original font and I added the names u05xx etc for Hebrew (without any associated glyph). The substitution was properly applied so that the Hebrew text was displaying Syriac glyphs but the Syriac features were not applied, which is consistent with rule 1 I cited above.

When I have time, I will make another experiment: put all the Syriac characters in the Hebrew range (that's a big cheat, which may also require renaming derived glyphs), replace the tag syrc by hebr in the feature definitions and see if the converted "Syriac" features are applied on that "Hebrew" glyph run (to use Adobe's terminology).

PS. I expect that applying ttx, then a sed script on the resulting ttx file and finally applying again ttx should be enough to get the desired font.

André G. Isaak · July 2017

When I have time, I will make another experiment: put all the Syriac characters in the Hebrew range (that's a big cheat, which may also require renaming derived glyphs), replace the tag syrc by hebr in the feature definitions and see if the converted "Syriac" features are applied on that "Hebrew" glyph run (to use Adobe's terminology).

This won't work -- the hebrew shaping engine doesn't know anything about the cursive properties of Syriac, and as I point out in my previous post, no changes made by your features is going to affect the fact that the underlying text is Hebrew, not Syriac.

I think they only way you'd be able to get this to work would be to define some sort 'calt' feature which basically does all the work normally done by the Syriac shaping engine (i.e. 'calt' would have to be used in place of 'init', 'medi', and 'fina'). As others have pointed out, though, this is probably not the best approach.

André

Michel Boyer · July 2017

André G. Isaak said:

the hebrew shaping engine doesn't know anything about the cursive properties of Syriac,

You may well be right (and most probably so). What I don't see is how that is a consequence of the Adobe spec. cited above. Where is the specification that implies that?

André G. Isaak · July 2017

The specification you're citing deals only with the feature file syntax. For the complete specifications, you'll want to look here:

https://www.microsoft.com/en-us/Typography/OpenTypeSpecification.aspx

André G. Isaak · July 2017

Just to add to the above, you might want to compare the following two documents:

https://www.microsoft.com/typography/otfntdev/arabicot/shaping.htm

https://www.microsoft.com/typography/otfntdev/hebrewot/shaping.htm

The first describes the Arabic shaping engine (which is also used for Syriac), whereas the second describes the Hebrew shaping engine. The crucial point here is that the Hebrew shaping engine doesn't call 'init', 'medi' and 'fina' for you, whereas the Syriac one does, and is aware of which Syriac characters can join and which can't.

Andre

Michel Boyer · July 2017

André

Those links describe the Uniscribe shaping engine. Is that considered a spec with which all applications on all platforms need to comply?

André G. Isaak · July 2017

Only applications which use uniscribe...

But similar principles are going to hold on other platforms such as DirectWrite, HarfBuzz, or ATS. if the input characters are Hebrew, whatever shaping engine is used is going to treat it as Hebrew, which means it isn't going to call on the relevant joining features in the font.

I should note that I'm actually a Mac person not a PC person. I realized after posting those links that uniscribe is dated, but I don't know the relevant DirectWrite links.

Andre

Michel Boyer · July 2017

I just did the experiment I described above and, if I select manually the proper features in FontForge, this time, (my old version of) FontForge applies them.

Image: https://us.v-cdn.net/5019405/uploads/editor/va/atzpzups5a3d.png

XeLaTeX does not and I can't guess what other application would.

John Hudson · July 2017

What you can do manually applying features in the preview panel of a font tool is going to differ considerably from what happens to actual text strings in applications. The font tool enables you to look at the output from the raw lookups, but those lookups are only going to be processed in software if the layout engine follows a path from the Unicode characters in the text to those lookups. If the lookups in question are associated with joining property features such as <init> <medi> and <fina>, those lookups are only going to be processed if the characters in the text string are characters that have joining properties according to Unicode's ArabicShaping.txt. Hebrew characters don't have joining properties, so those features are never going to be applied to Hebrew text.

Those links describe the Uniscribe shaping engine. Is that considered a spec with which all applications on all platforms need to comply?

Pretty much. When we find inconsistencies between other shaping engine behaviour and Uniscribe, we report it as a bug, and generally the developers acknowledge it as such. Microsoft led the way on complex script shaping for OpenType, so defined the standard.

John Hudson · July 2017

@Simon Cozens I think Paul Nelson, whom I am pretty sure wrote the <ccmp> feature description, used 'character' as shorthand for 'the default glyph mapped to the character in the cmap table'. The point is that <ccmp> is processed very early in layout, so the input is expected to be default glyph IDs from the cmap table (but possibly output from preceding <locl> feature). But yes, as noted, <ccmp> is a GSUB feature like any other, working entirely in glyph space.

John Hudson · July 2017

[My IUC39 presentation Beyond Shaping provides an overview of the steps involved in text processing and display with OpenType, which might be helpful in making sense of the issues raised in this thread. Yes, there are multiple shaping/layout engines at play — with some inconsistencies as noted in the presentation —, but they all follow basically the same overall model.]

Michel Boyer · July 2017

John Hudson said:

What you can do manually applying features in the preview panel of a font tool is going to differ considerably from what happens to actual text strings in applications. The font tool enables you to look at the output from the raw lookups, but those lookups are only going to be processed in software if the layout engine follows a path from the Unicode characters in the text to those lookups. If the lookups in question are associated with joining property features such as <init> <medi> and <fina>, those lookups are only going to be processed if the characters in the text string are characters that have joining properties according to Unicode's ArabicShaping.txt. Hebrew characters don't have joining properties, so those features are never going to be applied to Hebrew text.

Given that FontForge applies the features that are selected manually, I expected that XeLaTeX might do it since it is also possible in XeLaTeX to list features to be activated for each font.

John Hudson · July 2017

Presumably, at some level XeLaTeX is still using shaping engine support, so you would actually have to be overriding that support in order to apply joining property features manually to individual characters in Hebrew text strings. Such a thing might be possible with CSS font feature tags in browser text, for instance. In effect, you would manually be assuming the role of a shaping engine: analysing the text and determining to which characters to apply which joining property features. But I suspect all the way along software is going to be fighting against you, trying to do script shaping based on script properties in the text string.

The whole OpenType Layout model is predicated on layout engines looking after shaping intelligence above the font level, with the font supporting that process (contra Apple's AAT and SIL's Graphite models, in which the shaping intelligence is built into the fonts).

Benjamin Steiner · July 2017

Thanks for this discussion, and John thanks for the detailed explanation.

In fact yes I've tried adding Hebrew to the Substitutions' scripts in FontForge and when the glyphs come from Hebrew characters, they're still not applied:

Image: https://us.v-cdn.net/5019405/uploads/editor/ew/78s6elmzpudh.png

Overall need to type Syriac, but in Word for Mac this does not work well -- Word apparently does not understand that Syriac is a right-to-left language.

Made a post on the microsoft forum for this to be solved.

André G. Isaak · July 2017

I'm not clear on what the purpose of hebr{IWR ,SYR } in the above is, unless there is some tradition of writing syriac using hebrew characters of which I am not aware.

André G. Isaak · July 2017

Also, your problems with Word for Mac don't necessarily indicate that anything is wrong with your font. While windows has supported Syriac for quite some time, Mac OS has yet to provide support AFAIK.

Michel Boyer · July 2017

According to this link, the culprit is not OS X.

I can't read or write Syriac but I opened with Pages the .doc file provided by the link above, tried a few copy and paste and saw no bad word reordering. Pages says that the fonts used by the word document are missing and uses a default. The missing fonts are Talada and Adiabene.

Mapping a Unicode range to another

Comments

Tag: “ccmp”

Categories