TLDR: I basically have autokerning working. I consider the concept proven. It needs a little bit more polishing, but I don't have time right now. Will come back to it later.
Ever since watching the video of the panel discussion
at Typolabs, I have been intrigued with the idea of using a neural network to kern fonts. As Just made it clear, as the number of expected glyphs in a font grows, the number of kerning pairs grows exponentially and, even with kern classes, this isn't a job that human beings should still be doing. At the very least it something we should be trying to get automated.
I've had various philosophical discussions about whether we should be solving spacing or kerning or both, but the way I approach this kind of task is to automate the jobs we're already doing. While of course we integrate spacing and kerning into the design, the usual rule is to space first and kern later, so I wanted to attack this part of the job. We have autospacers; can we next have an autokerner? (Of course we can - Igino Marini has one. But I think we should all
get to have one.)
So for the past few months I have been reading all the books about machine learning and deep neural networks, and fiddling around with various ideas to make autokerning happen.
A neural network solution has two components: a trainer, which learns from available data, and a predictor, which gives you output from data it hasn't seen before. My original plan would be to semi-automate the process: that is, you kern half the font the way you want it kerned, and it learns from that and kerns the other half for you. But neural networks learn best on lots and lots of data, so in order to improve my results I ended up throwing all the fonts I could find (that I considered well-designed and well kerned) at it. Now I have something which does a reasonable job at kerning a font it has not seen before.
So in this case, the training component takes a directory full of fonts together with, for each font, a list of kern pairs generated by kerndump
. It then extracts, for each glyph that it considers "safe", a set of 100 samples of the left and right sidebearings, from the origin to the ascender. By samples of sidebearings, this is what I mean: (If you've studied how the HT Autospacer works, you'll be familiar with the concept.)
Let's call them the "right contour" and the "left contour" because that's what they represent. The right contour of this /a is an array of numbers [576,576,576,250,142,81,81,81,81,...].
The contours are then divided by the width of the font's /m in units, so that different fonts can be compared on the same basis. These contours are extracted for all the safe glyphs; a "safe" glyph is one that I can be reasonably sure that the designer will have checked the kerning against other safe glyphs: upper and lower case basic latin letters, numbers, comma, colon, period.
Now we feed the neural network a bunch of sets of contours. For each kern pair XY, we feed the right contour from X and the left contour from Y. We also give as input to the network the right contour of /n, the left contour of /o, the right contour of /H and the left contour of /O - this is for two reasons; first, because you can't "correctly kern" a pair in isolation without looking at the surround context of a word, and second, so that the network learns how the different contours found in different fonts and different styles affects the kerning process.
Each of these four contours are passed to a 1 dimensional convolutional layer, which essentially extracts the "shapeness" of it. 2-dimensional convolutional layers are used in image recognition for extracting edges and other salient features from a noisy photograph; a 1d layer does the same sort of thing with a one-dimensional object, and so I'm using it to work out what the contour "means" as a shape.
These six convolutional layers are then thrown into one big pot and boiled down through various-sized layers to generate an output. (It's not very easy to describe how neural networks actually work. They're basically a lot of numbers that get multiplied and added together, but it's pretty much impossible to say what each set of numbers contributes to the process.)
I initially planned to have the system output a kern value, making this a regression problem. However, regression problems are really hard for neural networks to do. The kern value under regression could be any float value from minus infinity to plus infinity, when real kern values don't work like that. NNs are much better at classification problems, so next I looked at a three-way classification: given this pair (and the background context /n/o/H/O) does it need to be kerned tighter, looser, or left alone?
Of course it's quite useful to know that a pair needs kerning, but what we really want to know is by how much. So once I had got the three-way classification working to a good degree of accuracy (about 99%), I then got the network to classify a pair into one of 26 "kern buckets". As before, everything is scaled to the /m width to allow comparison between different fonts, so one bucket might be, for example, "negative kern somewhere between 0.1875 and of 0.125 /m width" (giving you a range of, say, -150 to -100 units with a fairly condensed font). Other buckets, particularly around the zero mark, are narrower: "between -5 and 0 units", "between 5 and 10 units" and so on.
I said I had got to 99% accuracy in a three-way classification. To be honest I am no longer sure what degree of accuracy I am getting. It looks good, but it needs more work.
The reason I'm not sure is this: most pairs in most fonts aren't kerned, or have zero kern. There's a subtle but important difference between "aren't kerned" and "have zero kern". If the designer explicitly put a zero in the kern table for that pair, then great, they're happy with how it looks. But if they didn't put a zero in the kern table, there's no kern - which is effectively a kern value of zero - except that this time, we don't know whether the designer looked at the pair with no kern and decided no action was needed, or whether the designer didn't look at it at all and it might actually be ugly and we don't want to use it for learning. (Put your hands up if you kerned /t/Q recently.)
So for a while I didn't want to treat an empty kern entry as useful data. I only fed it explicit entries from the kerning tables. There aren't many explicit zero-valued kerns, so of course the network learned to apply kerning to nearly everything, because that is what it saw. But a font should
have lots of zero-valued kern pairs, so I had to put the "silent" zeros back in. And a font should have a lot of them, so I couldn't artificially balance the size of each bucket. The network should
learn that most kern pairs are zero.
And of course this makes it pretty had to assess accuracy, because the network learns that if it guesses zero for everything then it gets the right answer 90% of the time, so doesn't bother learning anything else. So I had to write my own loss function which penalized the network very heavily every time it guessed a zero when there should have been a kern. At this point accuracy started going down, because I really wanted the network to be more interested in getting things wrong in interesting ways than getting them right in stupid ways. (This is also a good principle for life in general.)
Here are some of the results of running the predictor on a font that the network had not seen before, my own Coolangatta Thin.
And looking at some of the values it outputs gives me hope that the network really has understood how certain shapes fit together. For instance, it has realised that similar shapes should have similar kern values:
T d -51 - -45 p=23 %
T e -40 - -35 p=21 %
T g -45 - -40 p=20 %
T o -40 - -35 p=21 %
(That's the kern bucket range followed by the probability, how sure the network is that the pair belongs to that category.) Rounds kern against a T, but it is pretty clear that straights don't:
T B 0 p=99 %
T D 0 p=96 %
T E 0 p=94 %
T F 0 p=94 %
Similarly it knows that there is a lot of counterspace between the right of /T and the left of /A, and that these shapes fit together, but that even though there is a lot of counterspace between /A and /A, these shapes cannot fit together. To me, I think this proves that it "gets" the idea of kerning. And that makes me happy.
I think I have proved the concept that this is viable. To improve, I would need to come back to this and train the network again on a larger variety of well-kerned fonts (currently I'm using a directory of 110 fonts. I am sure there are more) and for a longer time, on a more powerful computer, to squeeze out even more accuracy. If anyone wants to poke at the code, it's available on github: https://github.com/simoncozens/atokern
and I would be very happy to talk you through it.
I'm open to other suggestions of how to develop this. But not yet. I have been obsessing over this for the past few weeks and my family are sick of it, so I am needing to take a break. Will come back to it later.