Attaching a Devanagari Vedic Extensions Base glyph with a Mark

Ashwini Aggarwal · December 2020

The Unicode block Vedic Extensions has four base glyphs, the gomukhas, namely Unicode $1CE9, $1CEA, $1CEB, $1CEC.
These do not get attached to the marks Anusvara $0902 or Candrabindu $0901.

We are unable to attach them to any mark whatsoever, but Vedic texts are seen only with Anusvara or Candrabindu with these Gomukhas. How may we achieve the same?

John Hudson · December 2020

In what software are you testing this behaviour, and what does the broken result look like?

This is how the combinations appear in InDesign with the HarfBuzz layout engine active:

Image: https://us.v-cdn.net/5019405/uploads/editor/yc/0atyqjilo26o.png

Kamal Mansour · December 2020

Make sure the gomukhas (u1CE9–u1CEC) are all defined as base characters, and are also fitted with an anchor point compatible with the Anusvara and Candrabindu.

Ashwini Aggarwal · December 2020

We are using Microsoft Word 2016 on Windows 7, and Word 2019 on Windows 10.
The Result with font Sanskrit2020
https://sourceforge.net/projects/advaita-sharada-font/files/Devanagari/

Image: https://us.v-cdn.net/5019405/uploads/editor/dj/6bgh8vt155bk.png

and with font SanskritText

Image: https://us.v-cdn.net/5019405/uploads/editor/01/cobr2c6qywzp.png

Yes, the Gomukhas are all Base glyphs and have an Anchor compatible with Anusvara and Candrabindu.

Simon Cozens · December 2020

Your font works correctly in Harfbuzz, but not in CoreText. I suspect there are differences between the syllable clustering algorithms in Harfbuzz/CoreText/Uniscribe. Harfbuz considers this to be a "symbol cluster". I suspect the other engines don't have that rule.

John Hudson · December 2020

Or, the Windows DWrite and Mac CoreText Indic layout engines have not been updated to handle the Vedic characters at all.

Ashwini Aggarwal · December 2020

Thanks John for your observations. It certainly looks like the Windows shaping engine has not been updated to include the Vedic Extensions and Devanagari Extended Unicode blocks.
1. When we do an Insert Symbol, and choose our font or the Sanskrit Text font, the font subset popup dialog shows "Buginese" and "Phags-pa" respectively.
2. Also the Unicode Name for these glyphs shows a blank.

Image: https://us.v-cdn.net/5019405/uploads/editor/bi/rsfxhdzrut8f.png

Ashwini Aggarwal · December 2020

Thanks very much Simon for pointing out that our font works in Harfbuzz. For a simple reason Harfbuzz is not preferred for Vedic Sanskrit.
Since it does not prevent INVALID combinations which Uniscribe does admirably.

John Hudson · December 2020

While I understand how it can be helpful to indicate character sequences that break standardised norms, the proper level for that kind of functionality is a spellchecker, not a shaping engine. Such behaviour can be especially problematic when dealing with writing systems and languages with a long history, much of it pre-standardisation. In the early 2000s we had to persuade Microsoft to relax the valid sequence rules used in their Hebrew shaping engine because there were combinations occurring in Biblical text that don't occur in modern Hebrew, and scholars working with variant manuscripts, etc..

Ashwini Aggarwal · December 2020

In my opinion a spell checker is impractical and wouldn't do the job.
On the other hand, a shaping engine can very well
a) Prevent two diacritics of the same class from attaching to the Base glyphs.
e.g. Prevent both Anusvara and Candrabindu simultaneously from attaching to a Base glyph.
b) Prevent INVALID combining marks as given in Unicode specifications for Sanskrit.

Simon Cozens · December 2020

Can you give me an example of an invalid string which Uniscribe and Harfbuzz treat differently?

John Hudson · December 2020

HarfBuzz is quite permissive, in that it doesn't insert a dotted circle when a sequence might be considered invalid. The MS engine would insert dotted circle in these sequences, for example, because the first includes two vowel signs in sequence and the second includes both candrabindu and anusvara on the same base.

कीु कंँ

I think the future of orthography validation has to be at a higher level, such as a spellchecker or dedicated validation tool, because newly encoded Indic scripts are passed to the Universal Shaping Engine, which necessarily has a much more permissive model because it needs to handle scripts in which e.g. candrabindu and anusvara can occur on the same base (Chakma). At some point, it is likely that even scripts like Devanagari will be passed to USE (using a dev3 script tag, as Mac CoreText already supports semi-officially).

Ashwini Aggarwal · December 2020

Harfbuzz example shows Invalid strings - when we type क ी ू it joins them all. When we type क ं ँ again it joins all. क ृ ी े is again joined. And क ँ ँ as well.

Image: https://us.v-cdn.net/5019405/uploads/editor/p7/8f3fj51b9xj6.png

Uniscribe example - it simply does not allow joining of two diacritics of the same class in Sanskrit. The second diacritic is never typed. Only the first diacritic gets attached.

Image: https://us.v-cdn.net/5019405/uploads/editor/3q/3612nyp7qzoc.png

John Hudson · December 2020

I think there are a couple of different things at work here: one would be the shaping engine, and another would be the input method. As far as I know, Uniscribe (actually DWrite in most environments now) does not prevent typing of invalid sequences, but. only indicates what it considers invalid by insertion of the dotted circle. This is the result that I get typing in Windows 10 Notepad (which provides most direct access to shaping engines) using the standard Windows Hindi Traditional keyboard:

Image: https://us.v-cdn.net/5019405/uploads/editor/46/fs2pplscc8si.png

It may be the case, however, that a different keyboard input may itself prevent typing of what it defines as invalid sequences. I think, for instance, it is possible to create Keyman keyboards that behave that way.

Simon Cozens · December 2020

I've been digging around in Harfbuzz's syllable-forming and invalid cluster rules for other reasons, and now I understand a bit more about this problem. @Ashwini Aggarwal, what you're describing is not shaper behaviour. It is, as John mentions, the keyboard input method environment which is replacing the first matra with the second matra.

If you think about it for a while, it becomes obvious: If the shaper is being fed the sequence क ी ू and one of the matras just disappears, that's a bug. The shaper shouldn't ignore bits of text in the input. It should either render it as is, or add a dotted circle to mark an invalid cluster.

Harfbuzz (and CoreText) renders a cluster with multiple matras, which is correct because the shaping engine documentation for Devanagari explicitly allows for consonant clusters to have multiple matras (probably to handle split matra cases):

Consonant syllable:
{C+[N]+<H+[<ZWNJ|ZWJ>]|<ZWNJ|ZWJ>+H>} + C+[N]+[A] + [< H+[<ZWNJ|ZWJ>] | {M}+[N]+[H]>]+[SM]+[(VD)]
Where:
...
M matra (up to one of each type: pre-, above-, below- or post- base)

I imagine that if you put the text कीू into a file (without the keyboard IME getting in the way) and opened it, Uniscribe would also render them in the same way that Harfbuzz does.

My bet is that what's happening in John's case is that the Hindi Traditional IME is adding the dotted circle for him, and your IME is replacing one matra with another.

Denis Moyogo Jacquerye · December 2020

My bet is that what's happening in John's case is that the Hindi Traditional IME is adding the dotted circle for him, and your IME is replacing one matra with another.

This is what Microsoft Edge shows. It’s not the IME that inserts the dotted circle.

Image: https://us.v-cdn.net/5019405/uploads/editor/lh/mnexgovebexv.png

Simon Cozens · December 2020

This would not be the first time that Uniscribe does not follow its own documentation.

Peter Constable · December 2020

Simon Cozens said:

This would not be the first time that Uniscribe does not follow its own documentation.

Well, in fact, this is not such a situation: as indicated, it allows "up to one of each type: pre-, above-, below- or post- base" in that order. If you reverse the order of the vowels on the consonant in the example given, then it displays both without any dotted circles:

कुी

Would it make sense to any speaker of any language using Devanagari script to place u-kaar below ii-kaar? (Not meant as a rhetorical question.)

Simon Cozens · December 2020

Peter Constable said:

Well, in fact, this is not such a situation: as indicated, it allows "up to one of each type: pre-, above-, below- or post- base" in that order.

Where was the fact that they need to be in order indicated?

Peter Constable · December 2020

Granted, that's not made as clear as it could be: the positions are listed in that order, but it doesn't clearly state that the order matters. Nonetheless, the implementation is not in disagreement with the documentation.

Ashwini Aggarwal · January 2021

Simon Cozens said:
Harfbuzz (and CoreText) renders a cluster with multiple matras, which is correct because the shaping engine documentation for Devanagari explicitly allows for consonant clusters to have multiple matras (probably to handle split matra cases):

M	matra (up to one of each type: pre-, above-, below- or post- base)

This simply means only one Matra. (and not pre+above+...)
See below as you read further in https://docs.microsoft.com/en-us/typography/script-development/devanagari#analyze-the-text

"Indic clusters are subject to the following constraints:

Only one reph is allowed per syllable.
Only one pre-base reordering Ra is allowed per syllable.
A nukta can be placed on a consonant, matra or independent vowel. It cannot be placed on a pre-composed nukta character.
One matra from each positioning class is permitted (exception in the Kannada script). A composite matra is treated as belonging to all the classes from which its components belong.
One syllable modifier sign is allowed per cluster.
Vedic signs are combining marks (used for Sanskrit) that should be included in all Indic scripts.
Danda and Double Danda are punctuation marks that should be included in all Indic scripts."

John Hudson · January 2021

One matra from each positioning class is permitted (exception in the Kannada script). A composite matra is treated as belonging to all the classes from which its components belong.

One from each means that a cluster can contain more than one matra so long as each is from a different positioning class.

Ashwini Aggarwal · January 2021

Thanks so much everyone for your invaluable comments and discussion.
Wish you all a Productive, Creative and Successful New Year.

Just listing my understanding regarding Indic Shaping Requirements.

The Devanagari Unicode Block is 0900 to 097F = 128 glyphs.

From a purely Sanskrit point-of-view (where glyphs from Awadhi, Dravidian, etc. are not accounted for).

The characters herein are classified as:

Base Glyph Vowel = 0905 to 090C, 090F, 0910, 0913, 0914, 0960, 0961 = 14 glyphs = V
Base Glyph Consonant = 0915 to 0928, 092A to 0930, 0932, 0933, 0935 to 0939, 097A = 35 glyphs = C
Mark Glyph Matra Vowel Sign = 093E to 0944, 0947, 0948, 094B, 094C, 0962, 0963 = 13 glyphs = M
Mark Glyph Nasalization Aspirate Sign = 0900, 0901, 0902, 0903 = 4 glyphs = N
Mark Glyph Accent Stress Sign = 0951, 0952 = 2 glyphs = A

-----------------------------------------------------------------------

Numerals = 0966 to 0970 = 10 glyphs = D
Punctuation Sign = 0964, 0965 = 2 glyphs = P
Sandhi Sign = 093D = 1 glyph = S
Halant Virama Sign = 094D = 1 glyph = H

-----------------------------------------------------------------------

For the purposes of this discussion we have given arbitrary abbreviations:

- Base Glyph Vowel = V

- Base Glyph Consonant = C

- Mark Glyph Matra Vowel Sign = M

- Mark Glyph Nasalization Sign = N

- Mark Glyph Accent Stress Sign = A

Indic Shaping properties must account for the following to create VALID text:

Base Glyph Vowel = V cannot attach to Mark Glyph Matra Vowel Sign = M.

“VM” is an invalid combination. Indic shaping must prevent this. E.g. 0905+0940 is invalid.
2. Base Glyph Vowel = V can attach to only one glyph from the - Mark Glyph Nasalization Sign = N
“VNN” is an invalid combination. Here the last typed glyph should erase any previous N.
e.g. 0905+0901+0902 is invalid. 0905+0901+0901 is invalid.
3. Base Glyph Vowel = V can attach to only one glyph from the - Mark Glyph Accent Stress Sign = A
“VAA” is an invalid combination. Here the last typed glyph should erase any previous A.
e.g. 0905+0951+0952 is invalid. 0905+0951+0951 is invalid.
4. Base Glyph Vowel = V can attach to only two glyphs from glyph sets (N, A) with the provision that the two glyphs are from distinct glyph sets.
“VNA”, “VAN” are valid combinations. E.g. 0905+0901+0952 is valid.
“VNAN”, “VANA” are invalid. Here the last typed glyph should erase any previous N, A if duplicate.
5. Ligatures consisting of Punctuation Sign, Sandhi Sign must be prevented in Indic shaping.
6. Ligatures consisting only of Numerals 1 or 3 along with both the Accents are permitted.
0967 + 0951 + 0952 (all three together) is valid.
0969 + 0951 + 0952 (all three together) is valid.
0967 + 0951 is invalid.
0967 + 0952 is invalid.
0969 + 0951 is invalid.
0969 + 0952 is invalid.
7. Base Glyph Consonant = C can attach to only one Mark Glyph Matra Vowel Sign = M.
“CM” is a valid combination.
“CMM” is invalid. Indic shaping must prevent this by keeping only the last typed M.
8. Base Glyph Consonant = C can attach to only three glyphs from glyph sets (M, N, A) with the provision that the three glyphs are from distinct glyph sets, and CM cluster precedes the others.
“CMNA”, “CMAN”, “CNA”, “CAN” are valid combinations.
“CMNAN”, “CMANA” are invalid. Here the last typed glyph should erase any previous N, A if duplicate.
9. With the use of Halant Virama Sign H, indic shaping must select the appropriate consonant ligatures if present in the font.
10. For invalid combinations, the dotted circle glyph 25CC must show if implemented in font.

Attached a sample pdf file to illustrate the above points.

Ashwini Aggarwal · January 2021

Thanks so much everyone for your invaluable comments and discussion.
Wish you all a Productive, Creative and Successful New Year.

Just listing my understanding regarding Indic Shaping Requirements.

The Devanagari Unicode Block is 0900 to 097F = 128 glyphs.

From a purely Sanskrit point-of-view (where glyphs from Awadhi, Dravidian, etc. are not accounted for).

The characters herein are classified as:

Base Glyph Vowel = 0905 to 090C, 090F, 0910, 0913, 0914, 0960, 0961 = 14 glyphs = V
Base Glyph Consonant = 0915 to 0928, 092A to 0930, 0932, 0933, 0935 to 0939, 097A = 35 glyphs = C
Mark Glyph Matra Vowel Sign = 093E to 0944, 0947, 0948, 094B, 094C, 0962, 0963 = 13 glyphs = M
Mark Glyph Nasalization Aspirate Sign = 0900, 0901, 0902, 0903 = 4 glyphs = N
Mark Glyph Accent Stress Sign = 0951, 0952 = 2 glyphs = A

-----------------------------------------------------------------------

Numerals = 0966 to 0970 = 10 glyphs = D
Punctuation Sign = 0964, 0965 = 2 glyphs = P
Sandhi Sign = 093D = 1 glyph = S
Halant Virama Sign = 094D = 1 glyph = H

-----------------------------------------------------------------------

For the purposes of this discussion we have given arbitrary abbreviations:

- Base Glyph Vowel = V

- Base Glyph Consonant = C

- Mark Glyph Matra Vowel Sign = M

- Mark Glyph Nasalization Sign = N

- Mark Glyph Accent Stress Sign = A

Indic Shaping properties must account for the following to create VALID text:

Base Glyph Vowel = V cannot attach to Mark Glyph Matra Vowel Sign = M.

“VM” is an invalid combination. Indic shaping must prevent this. E.g. 0905+0940 is invalid.
2. Base Glyph Vowel = V can attach to only one glyph from the - Mark Glyph Nasalization Sign = N
“VNN” is an invalid combination. Here the last typed glyph should erase any previous N.
e.g. 0905+0901+0902 is invalid. 0905+0901+0901 is invalid.
3. Base Glyph Vowel = V can attach to only one glyph from the - Mark Glyph Accent Stress Sign = A
“VAA” is an invalid combination. Here the last typed glyph should erase any previous A.
e.g. 0905+0951+0952 is invalid. 0905+0951+0951 is invalid.
4. Base Glyph Vowel = V can attach to only two glyphs from glyph sets (N, A) with the provision that the two glyphs are from distinct glyph sets.
“VNA”, “VAN” are valid combinations. E.g. 0905+0901+0952 is valid.
“VNAN”, “VANA” are invalid. Here the last typed glyph should erase any previous N, A if duplicate.
5. Ligatures consisting of Punctuation Sign, Sandhi Sign must be prevented in Indic shaping.
6. Ligatures consisting only of Numerals 1 or 3 along with both the Accents are permitted.
0967 + 0951 + 0952 (all three together) is valid.
0969 + 0951 + 0952 (all three together) is valid.
0967 + 0951 is invalid.
0967 + 0952 is invalid.
0969 + 0951 is invalid.
0969 + 0952 is invalid.
7. Base Glyph Consonant = C can attach to only one Mark Glyph Matra Vowel Sign = M.
“CM” is a valid combination.
“CMM” is invalid. Indic shaping must prevent this by keeping only the last typed M.
8. Base Glyph Consonant = C can attach to only three glyphs from glyph sets (M, N, A) with the provision that the three glyphs are from distinct glyph sets, and CM cluster precedes the others.
“CMNA”, “CMAN”, “CNA”, “CAN” are valid combinations.
“CMNAN”, “CMANA” are invalid. Here the last typed glyph should erase any previous N, A if duplicate.
9. With the use of Halant Virama Sign H, indic shaping must select the appropriate consonant ligatures if present in the font.
10. For invalid combinations, the dotted circle glyph 25CC must show if implemented in font.

Attached a sample image to illustrate the above points.

Image: https://us.v-cdn.net/5019405/uploads/editor/bg/tfelxua0jpm7.jpg

John Hudson · January 2021

Here the last typed glyph should erase any previous N, A if duplicate.

Ligatures consisting of Punctuation Sign, Sandhi Sign must be prevented in Indic shaping.

These are still assumptions about how/where invalid sequences should be handled. No one is denying that some sequences of characters are definitely invalid for specific languages or generally in some scripts. What is debated is whether shaping engines should be the correct level at which to flag, prevent, or override invalid sequences.

The majority of Indic scripts now encoded in Unicode are passed to the Universal Shaping Engine for shaping, which necessarily has a far more permissive cluster model than the older Indic shaping engines still used for most Devanagari processing. That is the way of the future for Indic shaping—possibly for Devanagari too, if the dev3 script tag already supported by Apple is officially defined.

Attaching a Devanagari Vedic Extensions Base glyph with a Mark

Comments

Categories