How far away is a future in which precomposed characters built of letters and diacritical marks are no longer in fonts because they’re all composed on the fly with mark-to-base and mark-to-mark? Will we soon hit a point at which users can enter arbitrary mark combinations without having to input the combining versions marks which are not accessible from most keyboard layouts? It seems like browser developers are moving in this direction, but what about the rest of the software and OS world?
0
Comments
This came up at the OTL implementation meeting last week. Kamal Mansour from Monotype refloated a concept that he and I had first proposed to the OT mailing list a few years ago: a new cmap subtable format that would map from individual Unicodes to multiple glyph indices. This would enable bypassing the need for precomposed diacritic characters.
Unfortunately, the decision at the meeting was that this would involve significant problems for software makers, far outweighing the benefits considering that alternative methods exist to handle decomposition at the glyph processing level and automation of precomposed diacritics at the font tool level. Apart from needing to update software to recognise the new cmap format (during which time the precomposed diacritics would need to be included for backwards compatibility, likely for several years), there are performance implications for calculating buffer sizes when the ratio of glyph IDs to characters becomes essentially arbitrary.
So even though dynamic GPOS mark positioning is widely supported, and keyboard input for combining marks can be improved, there won't be a method to completely bypass precomposed diacritic characters at the font level.
There are some other options:
a) At the font level, developers could follow Adam Twardoch and Karsten Luecke's webfont approach, and simply use empty glyphs for the precomposed diacritic characters and decompose these to appropriate bases and combining marks in the 'ccmp' feature. This, unlike the cmap approach, doesn't save you GIDs, but unless you're making maxed-out CJK fonts à la Source Han Sans that's not a practical issue. Note, however, that a bug in InDesign fails to apply the 'ccmp' feature to decompose all precomposed diacritics.
b) At the text engine level, software could be changed to apply Unicode normalisation form D (decomposed) before applying OTL, performing a buffered conversion of precomposed characters to canonical decompositions. This is, in effect, the opposite of what many layout engines already do, mapping decomposed sequences to precomposed diacritics if available in the cmap table. Of course, knowing which mechanism to apply depends on knowing how the individual font is made. I'm considering proposing a new flag to be able to indicate that a given font either does not contain precomposed diacritics or prefers decomposed layout to be applied (that enables a single font to both opt for the new mechanism while providing backwards compatibility).
Another approach is to use “ghost” glyph ids instead of empty glyphs, i.e. the cmap will point to glyph ids with no corresponding entries in glyf and similar tables, and ccmp would map them to the actual decomposed glyphs. This should theoretically work, but I couldn’t convince any font generation tool to let me do this so can’t verify if it actually works (IIRC, Graphite uses this concept extensively, but that is a different beast and they control the only implementation).
The ghost GID idea is interesting, but I'm not surprised you've had trouble trying to build such a font.
_____
Behdad reports that Harfbuzz is already doing 'magic' normalisation during layout:
So that opens the door to the kind of font that James and others want to make, without precomposed diacritic glyphs. Of course, this only works for characters with Unicode canonical decompositions; additional decompositions will still need to happen in 'ccmp'.
Wouldn’t CALT work for that?
Unicode recognised the impracticalities of handling things like overlaid bars and slashes as combining marks, and would encode such diacritics as precomposed characters without canonical decompositions to the existing overlay marks (Unicode has commitments that prevent it from encoding any more characters with canonical decompositions).
Not sure what you mean by 'connection points' or how this differs from the current situation.
# --- Supports OTF Standard West code page (more or less)
# --- Uses GPOS mark-to-base for characters with diacritics
# --- FEB, 18 February 2016
# --- LANGUAGESYSTEMS
languagesystem DFLT dflt;
languagesystem latn dflt;
lookup MRKCLS_1 {
lookupflag 0;
markClass [gravecomb acutecomb circumflexcomb dieresiscomb tildecomb caroncomb] <anchor 0 0> @DIACRITIC_TOP_1;
pos base [a c e o s u z] <anchor 0 0> mark @DIACRITIC_TOP_1;
} MRKCLS_1;
lookup MRKCLS_2 {
lookupflag 0;
markClass [cedillacomb ogonekcomb] <anchor 0 0> @DIACRITIC_BELOW;
pos base [a c] <anchor 0 0> mark @DIACRITIC_BELOW;
} MRKCLS_2;
lookup MRKCLS_3 {
lookupflag 0;
markClass [dieresiscomb.i] <anchor 0 0> @DIACRITIC_TOP_2;
pos base [dotlessi] <anchor 0 0> mark @DIACRITIC_TOP_2;
} MRKCLS_3;
feature ccmp {
# --- Glyph Composition/Decomposition
sub i' @DIACRITIC_TOP_2 by dotlessi;
sub j' @DIACRITIC_TOP_2 by dotlessj;
} ccmp;
feature mark {
# --- Mark to Base positioning
lookup MRKCLS_1;
lookup MRKCLS_2;
lookup MRKCLS_3;
} mark;
@BASE = [a c e o s u z dotlessi];
@MARKS = [@DIACRITIC_TOP_1 @DIACRITIC_TOP_2 @DIACRITIC_BELOW];
table GDEF {
GlyphClassDef @BASE,,@MARKS,;
} GDEF;
Of course, in case of the f one can also use a variant with a shortened terminal to prevent problems with the dieresis. If one looks at Jenson’s ‘Eusebius’ type, then it seems that Jenson also did offset diacritics (I didn't find shortened terminals of the f and long s in his roman type).
In Medieval Latin, early tilde was an abbreviation mark to -am and -um endings, developed to save space in papyrus and parchments. Tilde was later used to sign long nasalization of a vowel even inside a word. Its position centered over the vowel was not well established until late 17th Century, being used before mostly as Jenson did.
Portuguese vowel combinations like ão appear as ão, aõ and with a tilde over both vowels in books up to 1800.
Some samples:
1578: published in France, with no city indication. Text in French.
1659: published in Lisbon. Text in Portuguese.
1747: published in Rio de Janeiro, very first book printed in Brazil. Text in Portuguese.
Really?
Sweynheym and Pannartz’s type as used in Opera from 1469 (Museum Meermanno col.):
Da Spira’s type as used in Historia Alexandri Magni from 1473 (Museum Meermanno col.):
Griffo’s type as applied in Hypnerotomachia Poliphili from 1499 (Museum Meermanno col.):
https://github.com/twardoch/ttfdiet/
The Readme has some test results.
BTW, we just released OTM 6 (also available from FL Ltd.)!