Automatic Font Repair (Prove of Concept)

The Problem

There exist many historical fonts crafted by amateurs using questionable techniques like:

- misusing code points (e.g. \$ for longs)
- precomposed ligature glyphs with code points in the PUA (not following the MUFI standard)
- names not always following best practises
- no feature rules
- overdone, problematic feature rules (e.g. orthography for longs with all irregularities, but it is not possible to render a single longs between spaces)
- not supporting base character + combining as preferred Unicode encoding, where Unicode has no precomposed code point (e.g. [AOUaou] + combining e above.

Concept

- normalise code points (for PUA use MUFI)
- normalise names of characters and variants
- support ZWJ and ZWNJ for explicit control of ligatures, parallel to hlig
- similar for fractions
- remove orthographic rules, resp. exchange all ligature and substitution rules by a basic standardised rule set

Purpose

Main purpose is the training of OCR systems. Perfect quality doesn't matter as the the training files are usually created by rendering sample images from hundreds of fonts for a language/period, and the images are also artificially degraded in variations. Also font identification can be trained to some degree. Reconstructed fonts are always digitised from a specific optical size, e.g. 12 or 16 pt, which are different in Fraktur. The larger the more swashed are capital letters. Often they are reconstructed from a reconstructed cut (late 19th century or Linotype customisation).

Implementation

My workflow is mostly developed in Perl, which has poor or no support for reading and manipulating OTF files.

But it's easy to work with XML (ttx). It's also easy to render specimens of characters and identify them optically with a relative high accuracy for unidentified glyphs in a font. Some remaining can be done manually.  

Planned steps are roughly:

- craft a standard set of rules in fea syntax for historical German

For each font:

- convert to ttx
- normalise code points and glyph names
- render and identify remaining glyphs
- report what's missing for manual solving or accept it
- remove feature rules and apply the standard set
- compile the repaired font

Doesn't mean that this an easy development task done in one day. Some problems with the Python font libraries fonttools and fontFeatures by @Simon Cozens can be expected as they maybe do not support some corner cases of this use case.


Comments

  • Ray Larabie
    Ray Larabie Posts: 1,441
    For fonts missing combining diacriticals, you could generate them and calculate correct placement based on the offsets where they're placed above or below letters.
  • For fonts missing combining diacriticals, you could generate them and calculate correct placement based on the offsets where they're placed above or below letters.
    That's possible. Depends, how the contours are composed. If the accents are not connected (cedilla) or overlapping (L with stroke), it's easy. Ideally the repaired font should use anchors and mark-mark.
  • The Problem

    There exist many historical fonts crafted by amateurs using questionable techniques like:

    - overdone, problematic feature rules (e.g. orthography for longs with all irregularities, but it is not possible to render a single longs between spaces)
    From your post it's fairly clear that you're primarily interested in fonts intended for mediaevalist usage rather than simply historical revivals in general.

    I think, though, it is important to distinguish between 'questionable techniques' and techniques that simply don't conform to your preferences.

    Just to comment on the longs issue, many fonts contain a 'hist' feature which simply converts all s's to longs (or to s.hist) which means that you essentially need to apply this feature on a letter-by-letter basis.

    I personally prefer a hist feature which contains "sub s' @lowercase by longs". This is far from perfect, but it easier to deal with applying this to a block of text and turning the feature off in a few cases than activating the feature on a character by character basis.

    This would mean that a single longs could not be created between two spaces by applying the 'hist' feature to 's', but it can still be inserted using unicode.

    I don't see anything 'questionable' about this technique since the opentype standard is really not clear on how to handle this. It is simply a matter of personal preference regarding what is the most convenient way for entering text with longs. [I should note I make fonts primarily for my own use].

    MUFI could adopt specific recommendations on how to handle these cases, but that would define what it means to be 'MUFI-compliant', which isn't necessarily the same thing as 'good technique'.

  • Just to comment on the longs issue, many fonts contain a 'hist' feature which simply converts all s's to longs (or to s.hist) which means that you essentially need to apply this feature on a letter-by-letter basis.

    I personally prefer a hist feature which contains "sub s' @lowercase by longs". This is far from perfect, but it easier to deal with applying this to a block of text and turning the feature off in a few cases than activating the feature on a character by character basis.
    That's not an option and no popular or professional font does this. A long s is a long s.

    If users can't read a long s, how should they read Fraktur or Schwabacher? 

    For German it's not possible to turn a small s into long s by a feature. The rules are complex. At the end of a syllable or word there is the normal "round" s, at the begin the long s. Thus Arbeitsamt is correct, because of Arbeits-amt, but Baumwoll-ſamt (cotton velvet).

    My focus is not mediaeval, more 18th and 19th century. I just want to provide also MUFI codepoints for precomposed glyphs where no Unicode code point exists. E.g. for a + combing e MUFI has a codepoint in the PUA. The font can make the glyph available in both ways. Same for ligatures like longs_s, c_k, c_h etc.

    Mediaval is a lot more complicated. That's what fonts like Andron by Andreas Stötzner supports.
  • Helmut Wollmersdorfer said:

    A long s is a long s.
    I'm a bit unclear on what you mean by the above. Are you suggesting that longs should always be entered as U+017F rather than as U+0073 and that the latter should be reserved for round s only?

    That seems to defeat the purpose of having a 'hist' feature, which is to allow a word like 'Shakeſpeare' to be encoded as underlying 'Shakespeare' for purposes of searching etc.

    I fully agree that it is not possible to create a hist feature which can be applied to a block of German text to convert lowercase s to longs, but German is not the only language which makes use of longs. For various periods of English contextual replacement works reasonably well.

    And if a font geared towards English users makes use of such a strategy, it isn't going to prevent someone from either (a) manually entering longs as U+017F or (b) deriving longs from s by applying the 'hist' feature on a character by character basis rather than by applying it to a paragraph as a whole which would seem to be the two options available for German.