Need to merge duplicate glyphs within a font into single unified glyphs

LVargas · May 2020

I am working on a scalable bitmap-like (a.k.a. pixelated) outline TTF font that sources its glyphs from a HEX plain-text file (like what Unifoundry’s Unifont project uses) that was originally sourced from a machine-generated BDF font. Because neither HEX nor BDF provides for mapping one single glyph (like “-”) to two or more character codepoints expecting to share one same glyph (like U+002D HYPHEN-MINUS, U+00AD SOFT HYPHEN, U+2010 HYPHEN, & U+2011 NON-BREAKING HYPHEN) due to their one-to-one glyph-to-char nature, my SFD project always ends up with an unnecessary amount of exact-duplicate glyphs.

I want to trim down the number of stored glyphs within my font project to an acceptable minimum of unique glyphs (allowing multiple encoding slots for certain glyphs like “-” from the example above) and thus reduce the final font size without decreasing Unicode coverage, but I don’t know if there is some automated, Perl-scriptable way for FontForge to detect all exact glyph duplicates within a font and merge/unify them all into single unique glyphs encoded to multiple characters. (I do not have the patience to manually check one-by-one all cases of glyph duplication in my font.)

Any help here would be greatly appreciated. Thankees!

Viktor Rubenko · May 2020

For TrueType fonts, you can use composite glyphs. The easiest way is to try to determine if some glyphs have the same outlines, and then leave one of them and replace the rest with components from this glyph. The same goes for glyphs with accents.
It can be easily done with Python, but with Perl... Idk

LVargas · May 2020

Viktor Rubenko said:

For TrueType fonts, you can use composite glyphs. The easiest way is to try to determine if some glyphs have the same outlines, and then leave one of them and replace the rest with components from this glyph. The same goes for glyphs with accents.

Uh, I am not seeking to do composite glyphs or simplify accented glyphs. What I am seeking is detecting exact-duplicate glyphs like the four hyphen examples I gave above (which share the same glyph without adding new components to it): replace four hyphen glyphs with just one glyph (and map the same glyph to four distinct codepoints), and repeat the same for other groups of characters sharing glyphs (with no additional components).

Viktor Rubenko said:

It can be easily done with Python, but with Perl... Idk

Well, the only Python I know is what I learned when dealing with VapourSynth (a video frame-editing-&-serving framework often used with VirtualDub2), but maybe if you show me an example Python code of what you could do to deal with your glyph-compositing case, maybe I could see if there is something there that could perhaps apply to my case – and if I see a way to rewrite it in Perl, then better. (It’s not the first time I do translate a Python code to Perl – I once did that when trying to rewrite code for mapping a non-Unicode BDF font to Unicode before applying the Unifont scripts to convert it to HEX and then to outlined TTF.)

Thankee! (hopefully)

Theunis de Jong · May 2020

A fairly straightforward solution in Python; you can convert it to Perl, or just run this in Python and store the result elsewhere to process further.

I downloaded this unifont.hex from GitHub but you can use the one you have, if it's in the same format ("Unicode value:hex string"). The result is a list of Unicode codepoints which have equal hex strings.

with open('unifont.hex') as f:
	data = f.readlines()

# 1. make a dictionary of hexstring:unicode
hexdict = {}
for line in data:
	ucode,hstr = line.strip().split(':')
	if hstr in hexdict:
		hexdict[hstr].append(ucode)
	else:
		hexdict[hstr] = [ucode]
# 2. filter out single unicodes
hexdict = {key:hexdict[key] for key in hexdict if len(hexdict[key]) > 1}
# 3. list the combined unicodes
for key in hexdict:
	print (' '.join(hexdict[key]))

.. and the first lines of the result looks like this:

As expected, the entry starting with the hyphen contains a few more characters:

002D 00AD 2012 2013 2212

– you see /hyphen, /dischyphen, /endash, /emdash, and /minus here.

LVargas · May 2020

Thankees. I will give it a try and possibly inform Paul Hardy of Unifoundry himself about this, since this would be greatly helpful to his project.

Helmut Wollmersdorfer · May 2020

In Perl you have also the options

1. use the CPAN module Font::TTF

It supports only the TTF file format. It can read, manipulate and write the tables of a font. The time needed to get into the guts is high.

2. use the command line utility ttx coming with FontForge resp. font-tools. Then manipulate the ttx file with your favorite XML module and convert it back to TTF.

I have a similar problem in repairing amateurish historical fonts. Duplicate glyphs, glyphs with a wrong code point, code points in the PUA.

Need to merge duplicate glyphs within a font into single unified glyphs

Comments

Categories