Empirical Statistics of Font/Glyph Metrics

It would be nice to get some additional ideas, what font metrics could be interesting, before I rework my existing programs.

First some words about context and intention. My focus is automatic digital reconstruction of old books, mainly about natural history, 17th and 18th century, languages German, English, Latin. This includes OCR and image refinement, for some reasons also automatic reconstruction of fonts. First it can help scientists to have a reconstruction of the original in digital form and switch per click to a modern font. Second it helps to improve the OCR recognition rate, which is a hen and egg problem. With nearly original fonts the training data can be generated automatically. At the moment it's done by manual transcription which is error prone. It's not bad, as the error rate of my models is 0.3% compared to what's usual for average texts ~1800, error rates somewhere in the range 4-7%. 

At the moment I use (not in production) font metrics along with image similarity for font identification and glyph clustering.

There are 3 ways to get the data:

1) interpreting the digital font directly
2) rendering the glyphs and measure using image processing
3) taking the measures from a real sample (scan or photo of a page)

For 2) and 3) I can use the same program.

What I measure at the moment:

- top, left, bottom, right (in relation to the baseline)
- descender, x-line, ascender (h-line), H-line
- aspect (height/width)
- density (black/white pixels)
- font size 

For reconstruction additional metrics are needed:

- spacing
- kerning if there is a overlap (in metal type it's negative spacing)

What would also be possible to measure:

- stroke-widths
- slant
- distance of diacritics

What else would be interesting? Sometimes information gets a new quality, if available across fonts. E.g. vertical proportions for reading sizes are not very different across fonts. Same for aspect.

Comments

  • Nick Shinn
    Nick Shinn Posts: 2,216
    The biggest problem I‘ve faced with creating digital font facsimiles of old pages is justification. Do you have a method which fixes the number of characters per line?
  • The biggest problem I‘ve faced with creating digital font facsimiles of old pages is justification. Do you have a method which fixes the number of characters per line?
    You mean the characters of a line should be in the same place? This is only possible, if a font has the same width and spacing as the original one.

    For Blackletter (black is another word for bold) it's obvious that the same characters in a modern regular "Swiss" add up to a shorter line, or ugly wide space between the words.

    The same problem appears, if a searchable PDF is generated: facsimile as picture with an overlay of text in an "invisible" font. The text can be marked with the mouse. Then some people claim that the length of a word is not same as in the underlaying picture. In a PDF single words can be positioned in the coordinate system. Thus the begin is correct. But different fonts have different run length.

    Found only this example where it is nicely visible. In the second line, first word (right to left) shorter, second longer.


  • Nick Shinn
    Nick Shinn Posts: 2,216
    This is only possible, if a font has the same width and spacing as the original one.

    No, that’s not the issue I experienced, it was the justification algorithms in layout apps such as Quark and InDesign that caused the difficulty. Even with the same glyph metrics as foundry type, I found it impossible to duplicate hand-set spacing. Often, the word spaces in the originals were far, far larger than anything the software permitted in any of its automatic settings. 

    I haven’t done any systematic research on this, or investigated Linotype, I’m just relating a problem I faced when making “restoration” revivals, fine-tuning them by attempting to create facsimile settings of pages which contained the original, on which I based my digital versions. 
  • John Hudson
    John Hudson Posts: 3,227
    edited January 2023
    In case there is a technical terminology language difference issue here: justification in the sense to which Nick refers is margin alignment on both left and right, so typically involving adjustments to the width of wordspaces in individual lines. Techniques for justifying text vary across time, location, and technology, and can be difficult to replicate algorithmically.
  • Nick Shinn
    Nick Shinn Posts: 2,216
    Because of the difficulty in duplicating line breaks with layout application “paragraph composers”, I would recommend that the software be set up to “fix” the first and last words in the line, then space the words in between evenly.
  • In electronic typesetting the spaces between words should be of the same length in the same line. Also in text set on linecasting machines like Linotype. They had something called space justification wedges.

    In manual metal typesetting (everything before ~1900) the usual space was 0.25 em, in Unicode U+2005  FOUR-PER-EM SPACE. Then at end of line the remaining space was divided by the typesetter and he inserted spacing material between the words. This sometimes was ugly.

    Look at the two last line of the sample. Space between word and comma, different space between the words.



    But space between words is something that I see not as metric of a font or glyph. It only makes sense to measure in letterspacing.