Glyph Collector

2»

Comments

  • Claudio PiccininiClaudio Piccinini Posts: 305
    edited September 2019
    It looks wonderful! I can’t wait!
  • Hrant H. PapazianHrant H. Papazian Posts: 1,570
    edited September 2019
    Now, who's going to have the guts to revive Cheltenham's ascending "r" above?
  • Now, who's going to have the guts to revive Cheltenham's ascending "r" above?
    Me, me! I love them. :)
  • Is anyone else experiencing difficulty downloading GlyphCollector from the links provided? The file seems to not be there. Is it possible to re-host it?
  • Vasil StanevVasil Stanev Posts: 480
    edited September 2019
    Why should I scan individual glyphs and distribute them to folders like it's 1997? An inbuilt OCR should be able to parse whole pages and distribute hits automatically, saving me the work.

    Half a day for all a's on a page? Aint nobody got time for that. Least of all institutions that will have to train operators for their digitised documents.
  • Why should I scan individual glyphs and distribute them to folders like it's 1997? An inbuilt OCR should be able to parse whole pages and distribute hits automatically, saving me the work.
    It probably depends on how much accuracy you want. Maybe both options could be provided. But he still has to release it. Isn’t it premature to comment?
  • edited September 2019
    IIRC, some time ago Gábor started to investigate OCR for automatically selecting ‘base’ glyphs, but encountered some technical problems. Of course, GC can also be used for collecting floriated capitals, ornaments, or whatever. Selecting ‘base’ glyphs in, for example, Preview is not really much work, especially not compared with what GC will automate then, I reckon.
  • Seems like the version available on the repo is still four years old. I am curious to see the new one!
  • Seems like the version available on the repo is still four years old. I am curious to see the new one!
    Me too!
  • Hi Thomas, hi Claudio, the latest edition of GlyphCollector for macOS and Windows is available from this GitHub page.
  • Hi Thomas, hi Claudio, the latest edition of GlyphCollector for macOS and Windows is available from this GitHub page.
    Hi Frank. Thanks, but on my Mac it quits as soon as it’s launched. :-( 
    (I am using a late 2009 tower with El Capitan 10.11.6 (the last one I could install).
  • Hi Claudo, I will have to check this to be sure, but macOS 10.12 (Sierra) is probably the oldest system that supports the new edition of GC, I reckon.
  • Does it need incoming network connections for some reason?
  • Since it's apparently written with electron, can we get a Linux build? Should be trivial… once the source code is out I could do it myself.
  • Hi Claudo, I will have to check this to be sure, but macOS 10.12 (Sierra) is probably the oldest system that supports the new edition of GC, I reckon.
    Dang. :-( 
  • Since it's apparently written with electron, can we get a Linux build? Should be trivial… once the source code is out I could do it myself.
    Agreed, would love to see source :)  https://github.com/krksgbr/glyphcollector/issues/2
  • Hi all!

    Sorry for the long wait.
    I've published the sources and I'll make a public release soon.

    https://github.com/krksgbr/glyphcollector
  • IIRC, some time ago Gábor started to investigate OCR for automatically selecting ‘base’ glyphs, but encountered some technical problems. Of course, GC can also be used for collecting floriated capitals, ornaments, or whatever. Selecting ‘base’ glyphs in, for example, Preview is not really much work, especially not compared with what GC will automate then, I reckon.
    Just detected this discussion via Youtube and Google.

    OCR meets Glyphdrawers.

    I'm coming from OCR of historical books. It's a difficult science with differente expert fields like automatic image correction (dewharping, local binarization), layout recognition, segmentation, character (and font) recognition, language processing, deep learning.

    Most of the large digitisation projects collect glyphs and calculate a prototype to improve the results of OCR. This cuts the error rates and manual correction costs roughly into half. The idea isn't new and was developed independently, or by reading scientific papers in the field. I also developed my own algorithms for this.

    One of the programs dated 2011 in the latest version was developed by a student for Project Runeberg in Sweden. It assumes only one font in the document. And it needs training like GlyphCollector, i. e. manual work.

    What are the problems using OCR for (automatic) font compilation?

    Let me explain with a screenshot from a diagnostic page of my automatic workflow:


    The above table shows the output of OCR, better said using the result and measures from OCR to cut out the small images from the page. It's from one page, printed ~1830, containing ~1500 characters, 85 different characters including symbols and punctuation, using 3 different fonts (Old Schwabacher, Breitkopf Fraktur, Caslon-like Antiqua).

    The second table groups (clusters) what OCR recognized as small letter c. Clustering allows better recognition of the right character and recognition of the font, even if the program doesn't know which font it is. 

    In the second table the second line is Antiqua, maybe containing some broken e. The third line is Blackletter e (the small letters between Fraktur and Schwabacher are very similar). 4th (Fraktur) and 5th (Schwabacher) line contain c as part of a ligature (ch, ck) split by OCR. For the calculation of prototypes (an "average" glyph) we must ignore outliers--broken, too much ink, small details filled with ink and mud, speckles. Look at the numbers in the first table. These are pixels. Rounding caused by image processing can make a difference of 15% at this small text or corpus sizes. Quality is limited as most books are scanned at 300 or 600 dpi, seldom at 1200 or 2400.

    It's also a problem how often a character appears. In most western languages 5-6 (ernsti) of the most frequent letters make 50%. Some letters are very seldom and maybe missing in the whole book. 

    Title pages are special. In old books they used 5-7 different fonts, differently flourished in different sizes, narrow, extra narrow. You will not have all glyphs in a line with 20 characters (two capitals). So you need many title pages (100+) with the same type face to collect all capitals and will still miss the X or Q. But the large sizes used in title pages will be of better quality, more usable for automatic vectorisation. The recognition rates of OCR on title pages are extremely poor. So for me collecting glyphs from title pages during OCR is essential, because I can get near 100% accuracy with my method. Storing the cutouts of the characters along with some metadata (book, publisher, year, city) doesn't need much storage. And maybe I can compile a font automatically (I'm still learning the guts of font files) which can be a base for fontdrawers to improve them. Or, if Gábor provides a clean interface (file level?), provide everything needed as input for Glyph Collector to minimise the boring manual work.
  • @Helmut Wollmersdorfer

    Thanks for comment, it's an interesting read!

    Could you please elaborate on what you mean by:

    And maybe I can compile a font automatically (I'm still learning the guts of font files) which can be a base for fontdrawers to improve them. Or, if Gábor provides a clean interface (file level?), provide everything needed as input for Glyph Collector to minimise the boring manual work.
    What goes in the font files? Is it the vectorized prototypes (averages)?
    What could I provide a clean interface for?

    As an aside, my current stance on OCR is this:
    I thought about using OCR in GlyphCollector but decided to instead keep the manual method for two reasons: it's significantly simpler than OCR, it's effective and it's language agnostic. GlyphCollector doesn't know about characters, it just provides a convenient user interface for running template matching on images. As Frank has mentioned before, providing samples is not that much work compared to what GC automates away. I would like to keep GlyphCollector as simple and focused as possible.
  • Gábor KerekesVasil Stanev criticises in his comment https://typedrawers.com/discussion/comment/43747/#Comment_43747. Also OCR systems have smart image preprocessing repairing degraded images (dewarping, despeckling, binarisation).

    For me the sampling is a by-product of OCR post-processing, as I collect the glyphs and compare them for improvement of the character recognition and font recognition. The core does the same: compare images. Result is a collection of all glyph images along with the page images.

    Sampling is not much work, if a clean type specimen listing all glyphs and sample pages in an identified font are available. If there are only books it needs many pages to e. g. a capital letter X or Y. The probability for letter X (and Y) in German language is 0.00001 or 1:100000. This needs somewhere in the magnitude of 50 pages with 2000 characters on each page to find an X, maybe it needs all 500 pages.


Sign In or Register to comment.