Arabic Unicode normalisation

Titus Nemeth · December 2018

Most Arabic fonts follow the Unicode normalisation for the 'Allah' ligature FDF2 into 0627 0644 0644 0647. However, in the glyph, generally two additional, separately encoded non-spacing marks, 0651 and 0670 are drawn. Typically this normalisation is echoced in the OT code that produces the fully vocalised 'Allah' ligature glyph, even if the underlying text does not contain the displayed marks.

Image: https://us.v-cdn.net/5019405/uploads/editor/nz/yfyglhuxlad9.png

For the sake of consistency, I tend to think that it would be better to include 0651 and 0670 in the OT code that calls the ligature, but this would be in contradiction with Unicode normalisation. Users can get the desired ligature either by keying the correct string or through the glyph palette. Because 0670 is not accessible on the standard keyboard, it may also be considered to allow for a hack, in which a GSUB replacement allows a 0654 to be keyed instead of 0670 to yield the FDF2 ligature.

Any views?

John Hudson · December 2018

For the sake of consistency, I tend to think that it would be better to include 0651 and 0670 in the OT code that calls the ligature, but this would be in contradiction with Unicode normalisation.

A couple of things to note:

1. Fonts are not responsible for implementing any Unicode normalisation, and glyphs processing isn't bound by normalisation properties and behaviours. Normalisation is a character level operation.

2. U+FDF2 has a compatibility decomposition to the sequence <0627 0644 0644 0647>; this means that U+FDF2 can be decomposed to that sequence, but that sequence is not to be composed to U+FDF2 in normalisation.

This compatibility decomposition is erroneous for the reason you note: it ignores the presence of the marks, which should be part of the decomposed sequence.

In terms of handling formation of this theograph in glyph processing, my recommendation is to include two versions of the ligature (if using a ligature substitution; it can also be formed with contextual letter shapes): one with marks and one without, and give them appropriate input sequences.

Khaled Hosny · December 2018

In addition to what John said, my experience with traditional-looking fonts is that Arabic users expect the sequence لله to give the name of God ligature (full with the shadda and small alef), but there are a few Arabic and non-Arabic words that have the same sequence but do not mean the name of God (it might even depend on vocalization; فالله can be فاللَّه or فالَلَه, the former need the ligature and the later does not). I got an idea from someone on website that is long gone to implement a black list in the font for sequences that must not have the ligature, my implementation is here, I haven’t gotten any bug reports about it in a long time, so I think it is good enough.

For simplified or modernish designs, I do away with the ligature altogether and I didn’t get any complaints.

Denis Moyogo Jacquerye · December 2018

See the related https://github.com/w3c/alreq/issues/125:

In Unicode 1.1, U+FDF2 had the following glyph:

Image: https://us.v-cdn.net/5019405/uploads/editor/qd/i2camna81w7x.png

It didn’t have a shadda nor a superscript alef.

Some fonts have FDF2 like that (no shadda, no superscript alef): Adelle Sans Arabic, Palatino Arabic, Palatino Sans Arabic, Zapfino Arabic, Neue Helvetica Arabic, Frutiger Arabic, Univers Next Arabic, Hasan Hiba and probably others .

Some fonts have FDF2 with a shadda but no superscript alef: Hasan Alquds Unicode and maybe others.

Some fonts have FDF2 with a shadda and a fatha instead of a superscript alef: Harmattan and probably others. This seems to be rather common in manuscripts and books actually.

Most Windows systems fonts with Arabic form the ligature with shadda and superscript alef automatically.

So any font may do any of these and the users shouldn’t assume typing (alef) lam lam heh (or heh goal), without shadda and superscript alef will produce that same thing everywhere.

I think the best way to handle this is to give the user exactly what they input. The ligature with additional marks could be accessible via stylistic set if input is an issue and OT features aren’t.

Titus Nemeth · December 2018

Hi all, and thank you for your comments.

I agree that the input should be reflected most directly, and am glad about the confirmation that Unicode normalisation should not be seen as prescribing glyph processing.

I have been including the two versions with and without vowels in all my fonts thus far, but ran into a glyph-palette insertion problem, which made me query the basics. Anyway it turned out to be related to a different issue though.

Regarding the black-list approach, I am not sure if this is necessary. Making it a user-choice (for example through dlig) seems more appropriate to me.

Thanks!

Arabic Unicode normalisation

Comments

Categories