Glyph Collector

Claudio Piccinini · September 2019

It looks wonderful! I can’t wait!

Hrant Հրանդ Փափազեան Papazian · September 2019

Now, who's going to have the guts to revive Cheltenham's ascending "r" above?

Claudio Piccinini · September 2019

Hrant H. Papazian said:

Now, who's going to have the guts to revive Cheltenham's ascending "r" above?

Me, me! I love them.

Paul Hanslow · September 2019

Is anyone else experiencing difficulty downloading GlyphCollector from the links provided? The file seems to not be there. Is it possible to re-host it?

Vasil Stanev · September 2019

Why should I scan individual glyphs and distribute them to folders like it's 1997? An inbuilt OCR should be able to parse whole pages and distribute hits automatically, saving me the work.

Half a day for all a's on a page? Aint nobody got time for that. Least of all institutions that will have to train operators for their digitised documents.

Claudio Piccinini · September 2019

Vasil Stanev said:

Why should I scan individual glyphs and distribute them to folders like it's 1997? An inbuilt OCR should be able to parse whole pages and distribute hits automatically, saving me the work.

It probably depends on how much accuracy you want. Maybe both options could be provided. But he still has to release it. Isn’t it premature to comment?

LeMo aka PatternMan aka Frank E Blokland · September 2019

IIRC, some time ago Gábor started to investigate OCR for automatically selecting ‘base’ glyphs, but encountered some technical problems. Of course, GC can also be used for collecting floriated capitals, ornaments, or whatever. Selecting ‘base’ glyphs in, for example, Preview is not really much work, especially not compared with what GC will automate then, I reckon.

Thomas Phinney · September 2019

Seems like the version available on the repo is still four years old. I am curious to see the new one!

Claudio Piccinini · September 2019

Thomas Phinney said:

Seems like the version available on the repo is still four years old. I am curious to see the new one!

Me too!

LeMo aka PatternMan aka Frank E Blokland · September 2019

Hi Thomas, hi Claudio, the latest edition of GlyphCollector for macOS and Windows is available from this GitHub page.

Claudio Piccinini · September 2019

LeMo aka PatternMan aka Frank E Blokland said:

Hi Thomas, hi Claudio, the latest edition of GlyphCollector for macOS and Windows is available from this GitHub page.

Hi Frank. Thanks, but on my Mac it quits as soon as it’s launched. :-(
(I am using a late 2009 tower with El Capitan 10.11.6 (the last one I could install).

LeMo aka PatternMan aka Frank E Blokland · September 2019

Hi Claudo, I will have to check this to be sure, but macOS 10.12 (Sierra) is probably the oldest system that supports the new edition of GC, I reckon.

Thomas Phinney · September 2019

Does it need incoming network connections for some reason?

Daniel Benjamin Miller · September 2019

Since it's apparently written with electron, can we get a Linux build? Should be trivial… once the source code is out I could do it myself.

Claudio Piccinini · September 2019

LeMo aka PatternMan aka Frank E Blokland said:

Hi Claudo, I will have to check this to be sure, but macOS 10.12 (Sierra) is probably the oldest system that supports the new edition of GC, I reckon.

Dang. :-(

Dave Crossland · October 2019

Daniel Benjamin Miller said:

Since it's apparently written with electron, can we get a Linux build? Should be trivial… once the source code is out I could do it myself.

Agreed, would love to see source

https://github.com/krksgbr/glyphcollector/issues/2

Gábor Kerekes · October 2019

Hi all!

Sorry for the long wait.
I've published the sources and I'll make a public release soon.

https://github.com/krksgbr/glyphcollector

Helmut Wollmersdorfer · January 2020

LeMo aka PatternMan aka Frank E Blokland said:

IIRC, some time ago Gábor started to investigate OCR for automatically selecting ‘base’ glyphs, but encountered some technical problems. Of course, GC can also be used for collecting floriated capitals, ornaments, or whatever. Selecting ‘base’ glyphs in, for example, Preview is not really much work, especially not compared with what GC will automate then, I reckon.

Just detected this discussion via Youtube and Google.

OCR meets Glyphdrawers.

I'm coming from OCR of historical books. It's a difficult science with differente expert fields like automatic image correction (dewharping, local binarization), layout recognition, segmentation, character (and font) recognition, language processing, deep learning.

Most of the large digitisation projects collect glyphs and calculate a prototype to improve the results of OCR. This cuts the error rates and manual correction costs roughly into half. The idea isn't new and was developed independently, or by reading scientific papers in the field. I also developed my own algorithms for this.

One of the programs dated 2011 in the latest version was developed by a student for Project Runeberg in Sweden. It assumes only one font in the document. And it needs training like GlyphCollector, i. e. manual work.

What are the problems using OCR for (automatic) font compilation?

Let me explain with a screenshot from a diagnostic page of my automatic workflow:

The above table shows the output of OCR, better said using the result and measures from OCR to cut out the small images from the page. It's from one page, printed ~1830, containing ~1500 characters, 85 different characters including symbols and punctuation, using 3 different fonts (Old Schwabacher, Breitkopf Fraktur, Caslon-like Antiqua).

The second table groups (clusters) what OCR recognized as small letter c. Clustering allows better recognition of the right character and recognition of the font, even if the program doesn't know which font it is.

In the second table the second line is Antiqua, maybe containing some broken e. The third line is Blackletter e (the small letters between Fraktur and Schwabacher are very similar). 4th (Fraktur) and 5th (Schwabacher) line contain c as part of a ligature (ch, ck) split by OCR. For the calculation of prototypes (an "average" glyph) we must ignore outliers--broken, too much ink, small details filled with ink and mud, speckles. Look at the numbers in the first table. These are pixels. Rounding caused by image processing can make a difference of 15% at this small text or corpus sizes. Quality is limited as most books are scanned at 300 or 600 dpi, seldom at 1200 or 2400.

It's also a problem how often a character appears. In most western languages 5-6 (ernsti) of the most frequent letters make 50%. Some letters are very seldom and maybe missing in the whole book.

Title pages are special. In old books they used 5-7 different fonts, differently flourished in different sizes, narrow, extra narrow. You will not have all glyphs in a line with 20 characters (two capitals). So you need many title pages (100+) with the same type face to collect all capitals and will still miss the X or Q. But the large sizes used in title pages will be of better quality, more usable for automatic vectorisation. The recognition rates of OCR on title pages are extremely poor. So for me collecting glyphs from title pages during OCR is essential, because I can get near 100% accuracy with my method. Storing the cutouts of the characters along with some metadata (book, publisher, year, city) doesn't need much storage. And maybe I can compile a font automatically (I'm still learning the guts of font files) which can be a base for fontdrawers to improve them. Or, if Gábor provides a clean interface (file level?), provide everything needed as input for Glyph Collector to minimise the boring manual work.

Gábor Kerekes · February 2020

@Helmut Wollmersdorfer

Thanks for comment, it's an interesting read!

Could you please elaborate on what you mean by:

Helmut Wollmersdorfer said:

And maybe I can compile a font automatically (I'm still learning the guts of font files) which can be a base for fontdrawers to improve them. Or, if Gábor provides a clean interface (file level?), provide everything needed as input for Glyph Collector to minimise the boring manual work.

What goes in the font files? Is it the vectorized prototypes (averages)?
What could I provide a clean interface for?

As an aside, my current stance on OCR is this:
I thought about using OCR in GlyphCollector but decided to instead keep the manual method for two reasons: it's significantly simpler than OCR, it's effective and it's language agnostic. GlyphCollector doesn't know about characters, it just provides a convenient user interface for running template matching on images. As Frank has mentioned before, providing samples is not that much work compared to what GC automates away. I would like to keep GlyphCollector as simple and focused as possible.

Helmut Wollmersdorfer · February 2020

Gábor Kerekes Vasil Stanev criticises in his comment https://typedrawers.com/discussion/comment/43747/#Comment_43747. Also OCR systems have smart image preprocessing repairing degraded images (dewarping, despeckling, binarisation).

For me the sampling is a by-product of OCR post-processing, as I collect the glyphs and compare them for improvement of the character recognition and font recognition. The core does the same: compare images. Result is a collection of all glyph images along with the page images.

Sampling is not much work, if a clean type specimen listing all glyphs and sample pages in an identified font are available. If there are only books it needs many pages to e. g. a capital letter X or Y. The probability for letter X (and Y) in German language is 0.00001 or 1:100000. This needs somewhere in the magnitude of 50 pages with 2000 characters on each page to find an X, maybe it needs all 500 pages.

Gábor Kerekes · March 2020

Hey Linux users!

I have finally managed to create a build system for Linux. Even though I use Linux for development, it turned out to be the most tricky to provide a distributable for it. I needed to compile a static executable of the backend (written in Haskell) and it took a while to figure it out.

https://github.com/krksgbr/glyphcollector/releases/tag/v0.2.0

I have also updated the website with some nice illustrations designed by Noemi Biro.

I would like to build a gallery of projects that GlyphCollector is being used in, so if you'd like to contribute to this let me know

Gábor Kerekes · March 2020

@Claudio Piccinini:

Pinging you because in an earlier message you mentioned you had an older macOS and you weren't able to run GlyphCollector.
Today I was investigating solutions to support pre-Sierra versions of macOS and it turns out that as a side effect of revamping the build system to make things work on linux, this issue seems to have gone away as well. I ran a test on a Yosemite virtual machine and it seems to work alright. Could you maybe try the latest version on your machine and see if it runs?

Aside:
I have created a discord server for GlyphCollector.
https://discord.gg/JwjZDvG
The plan is to use this for real-time troubleshooting when people encounter issues.

Glyph Collector

Comments

Categories