The Unicode block Vedic Extensions has four base glyphs, the gomukhas, namely Unicode $1CE9, $1CEA, $1CEB, $1CEC.
These do not get attached to the marks Anusvara $0902 or Candrabindu $0901.
We are unable to attach them to any mark whatsoever, but Vedic texts are seen only with Anusvara or Candrabindu with these Gomukhas. How may we achieve the same?
0
Comments
This is how the combinations appear in InDesign with the HarfBuzz layout engine active:
The Result with font Sanskrit2020
https://sourceforge.net/projects/advaita-sharada-font/files/Devanagari/
and with font SanskritText
Yes, the Gomukhas are all Base glyphs and have an Anchor compatible with Anusvara and Candrabindu.
1. When we do an Insert Symbol, and choose our font or the Sanskrit Text font, the font subset popup dialog shows "Buginese" and "Phags-pa" respectively.
2. Also the Unicode Name for these glyphs shows a blank.
Since it does not prevent INVALID combinations which Uniscribe does admirably.
On the other hand, a shaping engine can very well
a) Prevent two diacritics of the same class from attaching to the Base glyphs.
e.g. Prevent both Anusvara and Candrabindu simultaneously from attaching to a Base glyph.
b) Prevent INVALID combining marks as given in Unicode specifications for Sanskrit.
HarfBuzz is quite permissive, in that it doesn't insert a dotted circle when a sequence might be considered invalid. The MS engine would insert dotted circle in these sequences, for example, because the first includes two vowel signs in sequence and the second includes both candrabindu and anusvara on the same base.
कीु कंँ
I think the future of orthography validation has to be at a higher level, such as a spellchecker or dedicated validation tool, because newly encoded Indic scripts are passed to the Universal Shaping Engine, which necessarily has a much more permissive model because it needs to handle scripts in which e.g. candrabindu and anusvara can occur on the same base (Chakma). At some point, it is likely that even scripts like Devanagari will be passed to USE (using a dev3 script tag, as Mac CoreText already supports semi-officially).
Uniscribe example - it simply does not allow joining of two diacritics of the same class in Sanskrit. The second diacritic is never typed. Only the first diacritic gets attached.
It may be the case, however, that a different keyboard input may itself prevent typing of what it defines as invalid sequences. I think, for instance, it is possible to create Keyman keyboards that behave that way.
If you think about it for a while, it becomes obvious: If the shaper is being fed the sequence क ी ू and one of the matras just disappears, that's a bug. The shaper shouldn't ignore bits of text in the input. It should either render it as is, or add a dotted circle to mark an invalid cluster.
Harfbuzz (and CoreText) renders a cluster with multiple matras, which is correct because the shaping engine documentation for Devanagari explicitly allows for consonant clusters to have multiple matras (probably to handle split matra cases):
{C+[N]+<H+[<ZWNJ|ZWJ>]|<ZWNJ|ZWJ>+H>} + C+[N]+[A] + [< H+[<ZWNJ|ZWJ>] | {M}+[N]+[H]>]+[SM]+[(VD)]
Where:
...
M matra (up to one of each type: pre-, above-, below- or post- base)
My bet is that what's happening in John's case is that the Hindi Traditional IME is adding the dotted circle for him, and your IME is replacing one matra with another.
कुी
Would it make sense to any speaker of any language using Devanagari script to place u-kaar below ii-kaar? (Not meant as a rhetorical question.)Harfbuzz (and CoreText) renders a cluster with multiple matras, which is correct because the shaping engine documentation for Devanagari explicitly allows for consonant clusters to have multiple matras (probably to handle split matra cases):
This simply means only one Matra. (and not pre+above+...)
See below as you read further in https://docs.microsoft.com/en-us/typography/script-development/devanagari#analyze-the-text
"Indic clusters are subject to the following constraints:
One from each means that a cluster can contain more than one matra so long as each is from a different positioning class.
Wish you all a Productive, Creative and Successful New Year.
Just listing my understanding regarding Indic Shaping Requirements.
The Devanagari Unicode Block is 0900 to 097F = 128 glyphs.
From a purely Sanskrit point-of-view (where glyphs from Awadhi, Dravidian, etc. are not accounted for).
The characters herein are classified as:
-----------------------------------------------------------------------
-----------------------------------------------------------------------
For the purposes of this discussion we have given arbitrary abbreviations:
- Base Glyph Vowel = V
- Base Glyph Consonant = C
- Mark Glyph Matra Vowel Sign = M
- Mark Glyph Nasalization Sign = N
- Mark Glyph Accent Stress Sign = A
Indic Shaping properties must account for the following to create VALID text:
- Base Glyph Vowel = V cannot attach to Mark
Glyph Matra Vowel Sign = M.
“VM” is an invalid combination. Indic shaping must prevent this. E.g. 0905+0940 is invalid.2. Base Glyph Vowel = V can attach to only one glyph from the - Mark Glyph Nasalization Sign = N
“VNN” is an invalid combination. Here the last typed glyph should erase any previous N.
e.g. 0905+0901+0902 is invalid. 0905+0901+0901 is invalid.
3. Base Glyph Vowel = V can attach to only one glyph from the - Mark Glyph Accent Stress Sign = A
“VAA” is an invalid combination. Here the last typed glyph should erase any previous A.
e.g. 0905+0951+0952 is invalid. 0905+0951+0951 is invalid.
4. Base Glyph Vowel = V can attach to only two glyphs from glyph sets (N, A) with the provision that the two glyphs are from distinct glyph sets.
“VNA”, “VAN” are valid combinations. E.g. 0905+0901+0952 is valid.
“VNAN”, “VANA” are invalid. Here the last typed glyph should erase any previous N, A if duplicate.
5. Ligatures consisting of Punctuation Sign, Sandhi Sign must be prevented in Indic shaping.
6. Ligatures consisting only of Numerals 1 or 3 along with both the Accents are permitted.
0967 + 0951 + 0952 (all three together) is valid.
0969 + 0951 + 0952 (all three together) is valid.
0967 + 0951 is invalid.
0967 + 0952 is invalid.
0969 + 0951 is invalid.
0969 + 0952 is invalid.
7. Base Glyph Consonant = C can attach to only one Mark Glyph Matra Vowel Sign = M.
“CM” is a valid combination.
“CMM” is invalid. Indic shaping must prevent this by keeping only the last typed M.
8. Base Glyph Consonant = C can attach to only three glyphs from glyph sets (M, N, A) with the provision that the three glyphs are from distinct glyph sets, and CM cluster precedes the others.
“CMNA”, “CMAN”, “CNA”, “CAN” are valid combinations.
“CMNAN”, “CMANA” are invalid. Here the last typed glyph should erase any previous N, A if duplicate.
9. With the use of Halant Virama Sign H, indic shaping must select the appropriate consonant ligatures if present in the font.
10. For invalid combinations, the dotted circle glyph 25CC must show if implemented in font.
Attached a sample pdf file to illustrate the above points.
Wish you all a Productive, Creative and Successful New Year.
Just listing my understanding regarding Indic Shaping Requirements.
The Devanagari Unicode Block is 0900 to 097F = 128 glyphs.
From a purely Sanskrit point-of-view (where glyphs from Awadhi, Dravidian, etc. are not accounted for).
The characters herein are classified as:
-----------------------------------------------------------------------
-----------------------------------------------------------------------
For the purposes of this discussion we have given arbitrary abbreviations:
- Base Glyph Vowel = V
- Base Glyph Consonant = C
- Mark Glyph Matra Vowel Sign = M
- Mark Glyph Nasalization Sign = N
- Mark Glyph Accent Stress Sign = A
Indic Shaping properties must account for the following to create VALID text:
- Base Glyph Vowel = V cannot attach to Mark
Glyph Matra Vowel Sign = M.
“VM” is an invalid combination. Indic shaping must prevent this. E.g. 0905+0940 is invalid.2. Base Glyph Vowel = V can attach to only one glyph from the - Mark Glyph Nasalization Sign = N
“VNN” is an invalid combination. Here the last typed glyph should erase any previous N.
e.g. 0905+0901+0902 is invalid. 0905+0901+0901 is invalid.
3. Base Glyph Vowel = V can attach to only one glyph from the - Mark Glyph Accent Stress Sign = A
“VAA” is an invalid combination. Here the last typed glyph should erase any previous A.
e.g. 0905+0951+0952 is invalid. 0905+0951+0951 is invalid.
4. Base Glyph Vowel = V can attach to only two glyphs from glyph sets (N, A) with the provision that the two glyphs are from distinct glyph sets.
“VNA”, “VAN” are valid combinations. E.g. 0905+0901+0952 is valid.
“VNAN”, “VANA” are invalid. Here the last typed glyph should erase any previous N, A if duplicate.
5. Ligatures consisting of Punctuation Sign, Sandhi Sign must be prevented in Indic shaping.
6. Ligatures consisting only of Numerals 1 or 3 along with both the Accents are permitted.
0967 + 0951 + 0952 (all three together) is valid.
0969 + 0951 + 0952 (all three together) is valid.
0967 + 0951 is invalid.
0967 + 0952 is invalid.
0969 + 0951 is invalid.
0969 + 0952 is invalid.
7. Base Glyph Consonant = C can attach to only one Mark Glyph Matra Vowel Sign = M.
“CM” is a valid combination.
“CMM” is invalid. Indic shaping must prevent this by keeping only the last typed M.
8. Base Glyph Consonant = C can attach to only three glyphs from glyph sets (M, N, A) with the provision that the three glyphs are from distinct glyph sets, and CM cluster precedes the others.
“CMNA”, “CMAN”, “CNA”, “CAN” are valid combinations.
“CMNAN”, “CMANA” are invalid. Here the last typed glyph should erase any previous N, A if duplicate.
9. With the use of Halant Virama Sign H, indic shaping must select the appropriate consonant ligatures if present in the font.
10. For invalid combinations, the dotted circle glyph 25CC must show if implemented in font.
Attached a sample image to illustrate the above points.
These are still assumptions about how/where invalid sequences should be handled. No one is denying that some sequences of characters are definitely invalid for specific languages or generally in some scripts. What is debated is whether shaping engines should be the correct level at which to flag, prevent, or override invalid sequences.
The majority of Indic scripts now encoded in Unicode are passed to the Universal Shaping Engine for shaping, which necessarily has a far more permissive cluster model than the older Indic shaping engines still used for most Devanagari processing. That is the way of the future for Indic shaping—possibly for Devanagari too, if the dev3 script tag already supported by Apple is officially defined.