Hey,
I'm currently working on an OpenSource algorythm to determine from a list of word, which is the best to proof/test a font.
I work mostly with language dictionary as input. My script set a "score" for each words based on different condition.
Depending on whether the word responds positively or negatively to a condition its score increases or decreases.
For now, here are some conditions already set :
- Word length : The more a word is long, the more it gains points
- No repeated letters : If word don't have the same letter twice, it gain points
Because the goal is the check as many letters as possible, repeated letter is not something you want (most of the time).
- No hyphen : if Hyphen in word, it loose points
Hyphen breack word rhythm
- Letter singularity : for each letter in the word, it gain points depending if letter is a "singular". An "a" brings more points than an "i" or an "n" for example.
Ideally, we want a word contains letters that shows the characteristics of a font.
- Diversity : If a word contain too much oblique/round/descender/ascender/short letters, it loose points.
Be able to see the height of the descendants/ascendants, how obliques behave against vertical stems, how round shapes work with straight shapes, are useful things.
At the end my script return an ordered list with the words with the highest scores.
Here is a short example with en English Dict as entry :
{'housewarming': 156, 'motherfucking': 155, 'backgrounds': 147, 'thunderclaps': 146, 'considerably': 146, 'unforgivable': 146, 'indistinguishable': 145, 'malnourished': 145, 'counterintelligence': 145, 'misunderstandings': 145, 'guardsmen': 144, 'macpherson': 143, 'motherfuckin': 143, 'buckingham' ...}
For the moment, my script is specially made for Latin but I would like to integrate as many scripts as possible.
Do you have any others idea of conditions that would improve the algorithm for Latin ?
Or any other additional conditions for others scripts ?
7
Comments
(R may be the prefererable cap letter, as it has vertical stem, bowl, *and* diagonal. What was your highest scoring word that started with R?)
So the results are the "best" words set in lowercase. It could be great to add another option to find the "best" word set in UPPERCASE.
Here is the 10 words with the highest score, starting with an "r":
{'r': ['regulations', 'republicans', 'representatives', 'rosamund', 'replaying', 'ramblings', 'relocating', 'regionals', 'regulation', 'readings']
For many languages, you will probably want to ignore the presence of accent signs on letters when identifying words in this way; i.e. your results should include words with accented letters scored the same as the base letters. Although such words may not be helpful for testing during initial type design stages when diacritic marks have not been created, they give a more complete impression of characters within the language.
https://github.com/jenskutilek/WoLiBaFoNaGen
It lacks, of course, diagonals—an example of what Alastair Johnston called “prime rib” in Alphabets to Order, omitting the bothersome, nasty, spiky characters in specimens which were, after all, sales tools. For a similar reason Latin lower case (Quousque tandem etc…) was favored for body text, the spacing always being nice, before massive kerning became viable, to better fit the diagonal letters.
Of more use to type designers, “-iv” has been added: Hamburgefontsiv.
The premise being that if you get the hard stuff right first, the rest falls into place.
{'hamburgefontsiv': 178, 'hamburgefonts': 159, 'housewarming': 156, 'motherfucking': 155, 'backgrounds': 147, 'thunderclaps': 146, 'considerably': 146, 'unforgivable': 146, 'malnourished': 145, 'guardsmen': 144, 'macpherson': 143, 'motherfuckin': 143, ...
Here is the rating for each lowercase/uppercase :
"a":4, "A":2
I added an option to do the opposite, when a word contain more than one letter of a type of letter it gain more points (so for example, with this option chocolate have a better score, because it contains 3 "round" letters ["o", "c", "e"].
This is useful to find words to check consistency of round /or/ oblique /or/ ascender /or/ descender letters.
But no diagonals, unfortunately.
But this algorythm to find best proofing words work fine for Latin letters but can be highly improved for others script.
For each script I need to have something like this :
"groups":{
'descender' : ['p', 'q', 'y', 'j', 'g'],
{'a': 4, 'b': 2, 'c': 1, 'd': 2, 'e': 2, 'f': 2, 'g': 4, 'h': 2, 'i': 1, 'j': 2, 'k': 3, 'l': 1, 'm': 2, 'n': 1, 'o': 1, 'p': 2, 'q': 2, 'r': 3, 's': 3, 't': 2, 'u': 1, 'v': 2, 'w': 2, 'x': 2, 'y': 3, 'z': 2}},
"groups":{
'round' : ['C', 'D', 'G', 'O', 'Q'],
{'A': 2, 'B': 3, 'C': 1, 'D': 1, 'E': 1, 'F': 1, 'G': 3, 'H': 1, 'I': 1, 'J': 1, 'K': 2, 'L': 1, 'M': 2, 'N': 2, 'O': 1, 'P': 2, 'Q': 2, 'R': 3, 'S': 2, 'T': 1, 'U': 1, 'V': 1, 'W': 2, 'X': 1, 'Y': 2, 'Z': 1}}}
rating are rate for each letter depending if the letter show a lot/less characteristic of the font.
If anyone is interested in helping me with other scripts, I'd love to hear from you.
I have an idea for a new condition in your list: "Kerning"
Because some letter pairs have a natural tendency for the need of kerning while others not. (for example: AV,LV, LT, PJ, Av, Vo, vo, ke, xc etc)
It can be useful for us to purposely avoid this letter pairs combinations in the early development stages (when we have not kerned the font yet) since they can misled our spacing/rhythm decisions.
In early stages we may want the occurrence of this pairs to lower your score.
On the other side, in later stages when we work on kerning we may want the oposite, we may want to avoid letters with no kerning pairs to find words with many kerned pairs, so we may want to increase the score.
So, we may want to choose how kerning scores from these 3 options:
1) Ignore kerning pairs (as it is now)
2) No kerning pairs (decrease score, to avoid rhythm contamination)
3) Many kerning pairs (increase score, to check for kerning consistency)
The problem is that there is no "standard" kerning pairs list for all typeface designs, since they vary on each different typeface. And to solve that I've made python script (1) that summarized the most common pairs across 1000 great fonts and compiled the results in my Font Testing Page a few years ago. You can see the resulting pairs here (They are showed in the context of control letters, like HH or OO, but I hope you can easily extract the pairs to create a list you can use):
http://www.cyreal.org/Font-Testing-Page/index-latin-02.php (navigate to "Minimal Kerning Pairs" tab)
(1) https://github.com/impallari/Impallari-Fontlab-Macros/blob/master/IMP Kerning/21 Analize Kerning all fonts.py
By the way, if you need a spanish dictionary, mine is here for you to grab it:
https://github.com/impallari/Font-Testing-Page/tree/master/includes/dictionaries
Also, if you are curious, I have another "proof of concept" idea for a tool that shows kerning in a easy and intuitive way, for both kerned and unkerned pairs, in the context of long words here: https://github.com/impallari/Contextual-Kerning-Tool
Feel free to make it happen as a glyphs plugin if you like it.
Again, many many congrats for your awesome algorythm. I love it!!!
I think I will use the Revelant Kerning List made by Andre Fuchs.
I will also add an extra filter in my Context Manager Plugin, to filter words with potential kerning.
I will soon make a repo with my algorithm.