Automatic Font Repair (Prove of Concept)
Helmut Wollmersdorfer
Posts: 212
The Problem
There exist many historical fonts crafted by amateurs using questionable techniques like:- misusing code points (e.g. \$ for longs)
- precomposed ligature glyphs with code points in the PUA (not following the MUFI standard)
- names not always following best practises
- no feature rules
- overdone, problematic feature rules (e.g. orthography for longs with all irregularities, but it is not possible to render a single longs between spaces)
- not supporting base character + combining as preferred Unicode encoding, where Unicode has no precomposed code point (e.g. [AOUaou] + combining e above.
Concept
- normalise code points (for PUA use MUFI)- normalise names of characters and variants
- support ZWJ and ZWNJ for explicit control of ligatures, parallel to hlig
- similar for fractions
- remove orthographic rules, resp. exchange all ligature and substitution rules by a basic standardised rule set
Purpose
Main purpose is the training of OCR systems. Perfect quality doesn't matter as the the training files are usually created by rendering sample images from hundreds of fonts for a language/period, and the images are also artificially degraded in variations. Also font identification can be trained to some degree. Reconstructed fonts are always digitised from a specific optical size, e.g. 12 or 16 pt, which are different in Fraktur. The larger the more swashed are capital letters. Often they are reconstructed from a reconstructed cut (late 19th century or Linotype customisation).Implementation
My workflow is mostly developed in Perl, which has poor or no support for reading and manipulating OTF files.But it's easy to work with XML (ttx). It's also easy to render specimens of characters and identify them optically with a relative high accuracy for unidentified glyphs in a font. Some remaining can be done manually.
Planned steps are roughly:
- craft a standard set of rules in fea syntax for historical German
For each font:
- convert to ttx
- normalise code points and glyph names
- render and identify remaining glyphs
- report what's missing for manual solving or accept it
- remove feature rules and apply the standard set
- compile the repaired font
Doesn't mean that this an easy development task done in one day. Some problems with the Python font libraries fonttools and fontFeatures by @Simon Cozens can be expected as they maybe do not support some corner cases of this use case.
Tagged:
1
Comments
-
For fonts missing combining diacriticals, you could generate them and calculate correct placement based on the offsets where they're placed above or below letters.0
-
Ray Larabie said:For fonts missing combining diacriticals, you could generate them and calculate correct placement based on the offsets where they're placed above or below letters.0
-
Helmut Wollmersdorfer said:
The Problem
There exist many historical fonts crafted by amateurs using questionable techniques like:
- overdone, problematic feature rules (e.g. orthography for longs with all irregularities, but it is not possible to render a single longs between spaces)
I think, though, it is important to distinguish between 'questionable techniques' and techniques that simply don't conform to your preferences.
Just to comment on the longs issue, many fonts contain a 'hist' feature which simply converts all s's to longs (or to s.hist) which means that you essentially need to apply this feature on a letter-by-letter basis.
I personally prefer a hist feature which contains "sub s' @lowercase by longs". This is far from perfect, but it easier to deal with applying this to a block of text and turning the feature off in a few cases than activating the feature on a character by character basis.
This would mean that a single longs could not be created between two spaces by applying the 'hist' feature to 's', but it can still be inserted using unicode.
I don't see anything 'questionable' about this technique since the opentype standard is really not clear on how to handle this. It is simply a matter of personal preference regarding what is the most convenient way for entering text with longs. [I should note I make fonts primarily for my own use].
MUFI could adopt specific recommendations on how to handle these cases, but that would define what it means to be 'MUFI-compliant', which isn't necessarily the same thing as 'good technique'.0 -
André G. Isaak said:
Just to comment on the longs issue, many fonts contain a 'hist' feature which simply converts all s's to longs (or to s.hist) which means that you essentially need to apply this feature on a letter-by-letter basis.
I personally prefer a hist feature which contains "sub s' @lowercase by longs". This is far from perfect, but it easier to deal with applying this to a block of text and turning the feature off in a few cases than activating the feature on a character by character basis.
If users can't read a long s, how should they read Fraktur or Schwabacher?
For German it's not possible to turn a small s into long s by a feature. The rules are complex. At the end of a syllable or word there is the normal "round" s, at the begin the long s. Thus Arbeitsamt is correct, because of Arbeits-amt, but Baumwoll-ſamt (cotton velvet).
My focus is not mediaeval, more 18th and 19th century. I just want to provide also MUFI codepoints for precomposed glyphs where no Unicode code point exists. E.g. for a + combing e MUFI has a codepoint in the PUA. The font can make the glyph available in both ways. Same for ligatures like longs_s, c_k, c_h etc.
Mediaval is a lot more complicated. That's what fonts like Andron by Andreas Stötzner supports.0 -
That seems to defeat the purpose of having a 'hist' feature, which is to allow a word like 'Shakeſpeare' to be encoded as underlying 'Shakespeare' for purposes of searching etc.
I fully agree that it is not possible to create a hist feature which can be applied to a block of German text to convert lowercase s to longs, but German is not the only language which makes use of longs. For various periods of English contextual replacement works reasonably well.
And if a font geared towards English users makes use of such a strategy, it isn't going to prevent someone from either (a) manually entering longs as U+017F or (b) deriving longs from s by applying the 'hist' feature on a character by character basis rather than by applying it to a paragraph as a whole which would seem to be the two options available for German.1
Categories
- All Categories
- 43 Introductions
- 3.7K Typeface Design
- 806 Font Technology
- 1.1K Technique and Theory
- 622 Type Business
- 446 Type Design Critiques
- 543 Type Design Software
- 30 Punchcutting
- 137 Lettering and Calligraphy
- 84 Technique and Theory
- 53 Lettering Critiques
- 489 Typography
- 304 History of Typography
- 115 Education
- 70 Resources
- 500 Announcements
- 80 Events
- 105 Job Postings
- 149 Type Releases
- 165 Miscellaneous News
- 271 About TypeDrawers
- 53 TypeDrawers Announcements
- 117 Suggestions and Bug Reports