Application side character compositing: a + dieresis → adieresis

yanone · October 2013

Hi everyone,
just for the sake of understanding OpenType:

is it at all possible to create a font that has an a and a dieresis and have applications automatically compose the adieresis, although it wouldn’t be in the font as a precomposed glyph?

I understand that it’s possible to have characters composed through mark positioning, but that would require the explicit use of encoded combining marks, and the encoded string would contain the two glyphs separately. While this would work in cases such as phonetic writing with modifiers, it wouldn’t work for normally encoded characters such as the ä.

Or would it?

John Hudson · October 2013

No, this is not currently possible. A couple of years ago, I suggested a mechanism, using a new cmap subtable format, that would make this possible, but there was little enthusiasm for the idea at the time. The feedback from Adobe was that any mechanism for composing ä with positioning could just as easily be used to build composites or subroutines for precomposed glyphs in a font, and this would be easier for font makers than for software to implement support for a new cmap format.

[Some layout engines do the reverse operation, though. If text is encoded with combining mark sequences, the layout engine will check the font cmap table to find precomposed encodings.]

[Deleted User] · October 2013

I hope that Adobe has rethought or will rethink their position. Unicode defines composition but also decomposition.

Georg Seifert · October 2013

What about web browsers? Do they support decomposition?

John Hudson · October 2013

We should be careful to distinguish what happens in text normalisation (encoding) from what happens in display (layout engines and fonts). Software needs to be able to handle both NFC and NFD text strings, but how those get displayed is sort of incidental to the underlying text encoding. At present, we have layout engines and fonts that a) map precomposed diacritics to precomposed glyphs (straight cmap mapping), b) mark decomposed diacritics to precomposed glyphs (buffered cmap mapping), c) map uncomposed diacritics to base+mark glyph sequences (GPOS display), d) map uncomposed diacritics to precomposed glyphs (GSUB display), and e) map precomposed diacritics to base+mark glyph sequences (GSUB + GPOS display). Some of these display mechanisms operate at a character level (as encoded in text of buffered prior to cmap mapping), and some operate at the glyph processing level.

What we're missing is a (buffered) character level mechanism to map from precomposed diacritics to decomposed base+mark cmap entries, i.e. before one gets to the glyph processing level. This would enable us to make fully decomposed fonts, which would not need to include any precomposed glyphs.

For characters with Unicode normalisation compositions/decompositions, this could in fact be handled entirely at the layout engine level, with no changes to font formats. In the same way that layout engines will currently check cmap entries for matching precomposed mappings for decomposed text sequences, they could check cmap entries for decomposed glyph sequences for precomposed diacritic characters.

For me, the more interesting possibilities exist beyond the fixed set of characters with normalised decompositions, and what I proposed to the OpenType list was a new cmap format in which Unicode characters could be mapped to arbitrary glyph sequences. This would enable one to not only handle canonical decompositions but also things like stroke decompositions. [This is, by the way, one of the functionalities of DecoType's Arabic layout model: the mapping from Unicode characters to decomposed strokes without the need to go first to a precomposed glyph entry.]

But I fully understand why Adobe and others might consider it too late in the day to consider implementing such a mechanism in the context of OpenType, a format with heavy legacy inheritance. The number and diversity of operating systems, layout engines, and applications that would need to implement support for the new mechanism is such that there would be long-term pressure on font makers to avoid making fonts in this way or, at least, to make hybrid fonts with precomposed fallback, thereby diminishing the whole point of the exercise. And as Adobe pointed out at the time, anything that can be handled as decomposed glyphs can also be handled with composites or subroutines, meaning that precomposed glyphs for precomposed diacritics cmap entries can generated with scripting and have minimal impact on font size.

Georg Seifert · October 2013

Many thanks for this thorough explanation.

Michael Vokits · October 2013

This is precisely how older distros of TeX do things, but suboptimally. It treats accented characters as a sequence, which hampers searches in the resulting documents. Plus, its simple approach ignores optical balance. Sometimes you need to fiddle with the shapes for best results: accents that work with /o might not work as well with /i, for example, and a narrow /a (hello, Bembo!) might deserve a third style. Sometimes you need to place different accents in slightly different locations. The results can disappoint. It shouldn’t have to be this way!

I think the best solution is to have a general set of composing rules, but to allow each font to override them by providing code or precomposed glyphs.

John Hudson · October 2013

It treats accented characters as a sequence, which hampers searches in the resulting documents.

If the text is Unicode, that shouldn't be an issue if the search function is doing what it should do. The precomposed and decomposed strings are canonically equivalent, and a good search function should normalise to capture both.

Michael Vokits · October 2013

@John Hudson:

"If the text is Unicode, that shouldn't be an issue if the search function is doing what it should do. The precomposed and decomposed strings are canonically equivalent, and a good search function should normalise to capture both. "

That would indeed be ideal, but Unicode is still a foreign language to most of the TeX world, alas. Seven bits is not enough.

Application side character compositing: a + dieresis → adieresis

Comments

Categories