Check language support tools

Adam Ladd · August 2018

I'm aware of a few tools that check the language support in a font (OTMaster 370 Light — which I prefer because it also gives me a number — , Underware validator, etc... even Font Book info view), but they seem to all have a little different result of how many languages are actually supported/complete. I don't know that an exact hard figure can be had due to varying factors, but I'm wondering if others have found any of these tools to be most accurate? (Also already looked at this thread: https://typedrawers.com/discussion/comment/30880#Comment_30880)

Thanks.

André G. Isaak · August 2018

I can’t comment on the relative accuracy of these tools, but I think it’s inevitable that you’re going to find discrepancies. It may be relatively easy to identify which letters and diacritics are absolutely essential to a particular language, but that’s not quite the same as which letters are needed to fully support a particular language. As a trivial example, English doesn’t require any diacritics, but it would be naïve to think that there are no English words containing them. Similarly, other languages will often make frequent use of diacritics and/or letters found in neighbouring languages and there’s really no way to objectively decide on how common such characters must be before they should be considered necessary to properly support a particular language.

Thomas Phinney · August 2018

Also, the count, the number of supported languages is going to vary depending on which languages the tool knows about. It’s not like there is some definitive list of how many languages there are that use the Latin alphabet, or Cyrillic, for instance.

How many living speakers/writers does a language need to be included? What about dead languages (including Latin itself!), do you include or exclude them based on scholarly importance? What about languages that were formerly written with one writing system but are now written with another? And that’s not even considering any arguments about what constitutes a language in the first place.

notdef · August 2018

“Danish not supported. Missing glyphs: Aringacute, aringacute”

John Savard · August 2018

Frode said:

“Danish not supported. Missing glyphs: Aringacute, aringacute”

Never having heard of these glyphs, I did a web search, and found out that Unicode had such a glyph, a Danish A with a circle over it, then with an acute accent over the combination.

Further searching led me to discover that it might be used in Danish dictionaries where syllable stress is marked.

In Russian, syllable stress is marked by acute accents over vowels. And in English, dictionaries will show the pronounciation of words using special symbols like the schwa. I wonder if language support tools are equally pedantic for both of those languages.

Johannes Neumeier · August 2018

I personally like the Alphabet Type tools for their comprehensiveness and their "two way workflow", i.e. either select languages and see what glyphs they need, or upload an file and check against languages you wish to support. Also you can pick not only languages, but unicode code blocks and even charset (more of a historic angle, but it, too, shows what certain standardization bodies and large corporations once thought essential to a language). And it gives you lists for required, auxiliary and punctuation glyphs per language, which is very nice to better gauge how important a glyph might be.

In lieu with @Frode’s comment there are some odd glyphs for certain languages here, too, so you need to use your own judgement, always.

On the broader topic of language support it is also interesting to see for example the above mentioned tool and check what is listed as auxiliary for a language you yourself speak. At least for me those include characters I've never even encountered in those languages; regional minority languages, support for loan words I've never seen written with those characters, historic glyphs...

Adam Ladd · August 2018

The Alphabet Type tools are nice and useful. I've looked at them before but not too deeply. This prompted me to look again. Thanks.

Dave Crossland · November 2024

Shaperglot is now web based!

https://googlefonts.github.io/shaperglot

I think this is impressive rust/wasm work from @Simon Cozens, as this language checking tool is better than anything else out there in terms of raw language data, because it includes opentype shaping checks - but the UX needs love

John Hudson · November 2024

Also new since this thread was last live: Hyperglot.

mitradranirban · November 2024

The underlying language data on which shaperglot or similar tools works need to be checked also. The last time I checked NotoSansDevanagari failed shaperglot test for Hindi as the tool was placing standalone vowel marks (which must show dotted circle as per OpenType recommendation ) and then taking dotted circle as a fail.
I had submitted a bug report which is yet to be closed.

Simon Cozens · November 2024

mitradranirban said:

The underlying language data on which shaperglot or similar tools works need to be checked also. The last time I checked NotoSansDevanagari failed shaperglot test for Hindi as the tool was placing standalone vowel marks (which must show dotted circle as per OpenType recommendation ) and then taking dotted circle as a fail.
I had submitted a bug report which is yet to be closed.

This is fixed in the Rust/WASM implementation:

Image: https://us.v-cdn.net/5019405/uploads/editor/hl/t5gurp0zm7b9.png

Kent Lew · November 2024

Are these really *necessary* to fully support the Finnish language?: Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ

Image: https://us.v-cdn.net/5019405/uploads/editor/uh/5cxixo5hdj8l.png

What do you mean by “fully”? Seems like making “full support” dependent upon such auxiliary characters could be misleading to the average user or inexperienced type designer.

Simon Cozens · November 2024

Kent Lew said:

Are these really *necessary* to fully support the Finnish language?: Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ

Short answer: yes.

What do you mean by “fully”? Seems like making “full support” dependent upon such auxiliary characters could be misleading to the average user or inexperienced type designer.

I'm starting from the perspective that language support is not binary but granular.

For example, a font which has a smcp feature may have all the codepoints necessary to write text in Yoruba, but its smcp feature may not provide small caps versions of ẹ and ḿ. Does it support Yoruba? Sure. Does it fully support Yoruba? Not as well as it supports other languages, so no. Is this better or worse Yoruba support than a font which has all the codepoints but which has no smcp feature at all?

Auxiliary characters are similar. é is not a letter used to write in English but "café" is a word which does get written in English texts. Because of this, it's not wrong to say that a font which contains an é is better for English users than one which does not. But it's equally not wrong to say that a font without an é supports the English language.

"Fully supports", "supports" and "does not support" (together with a percentage score) is a way to capture this kind of "it could be better" distinction. I believe this, together with the rationales and explanations you can find under the disclosure triangles, is less misleading than a simple binary yes/no.

Kent Lew · November 2024

Fair enough. I agree that support is a granular concept. In this case, I suppose my initial reaction was the difference between a casual or lay concept of “fully” and a more specific, contextual use.

Word borrowings from neighboring or immigrant languages being absorbed over time create a lot of edge cases, which get slippery.

Also, it feels weird to say that many common, versatile & popular fonts don’t fully support the German language (because they don’t include the historical ſ), for example.

John Hudson · November 2024

Simon, I think the granular approach runs into a classification problem at the border of language support and text support. I also think the generalised bucket of ‘auxiliary codepoints’ is problematic, because there are several quite distinct categories of extra-alphabetic characters that can occur in specific kinds ot text. Lumping them all together and saying that they constitute ‘full’ support for a language is problematic.

Then there is the question of sources. Take those ostensible auxiliary characters for Finnish, which I do not occur in any of the repertoires of Finnish alphabet that I checked this morning. Where do they come from? As far as I can tell, the Google Fonts Language Database does not contain any source information. Looking at the entry for Finnish, it seems that the list of auxiliary characters includes basically all diacritics used in any other European language.

auxiliary: "Á À Ă Â Ã Ą Ā Ć Č Ċ Ç Ď Ð Đ É È Ê Ě Ë Ė Ę Ē Ğ Ǧ Ģ Ǥ Ȟ Ħ Í Î Ï İ Į Ī I Ǩ Ķ Ĺ Ľ Ļ Ł Ń Ň Ñ Ņ Ŋ Ó Ò Ô Ő Õ Œ Ŕ Ř Ś Ŝ Ş Ș ẞ Ť Ţ Ț Ŧ Ú Ù Û Ů Ű Ų Ū Ý Ÿ Ü Ź Ż Ʒ Ǯ Þ Æ Ø á à ă â ã ą ā ć č ċ ç ď ð đ é è ê ě ë ė ę ē ğ ǧ ģ ǥ ȟ ħ í î ï į ī ı ǩ ķ ĺ ľ ļ ł ń ň ñ ņ ŋ ó ò ô ő õ œ ŕ ř ś ŝ ş ș ß ť ţ ț ŧ ú ù û ů ű ų ū ý ÿ ü ź ż ʒ ǯ þ æ ø"

That doesn’t seem very useful as a determinative for ‘full’ support of the Finnish language. Am I right in guessing that the source for this data is a corpora trawl of online text tagged as Finnish?

Jens Kutilek · November 2024

John Hudson said:

That doesn’t seem very useful as a determinative for ‘full’ support of the Finnish language. Am I right in guessing that the source for this data is a corpora trawl of online text tagged as Finnish?

It seems those are coming from the Unicode CLDR.

Simon Cozens · November 2024

Indeed, anything which doesn't have a "source" entry is from CLDR.

John Hudson · November 2024

Further to the point about the difference between supporting a language and supporting a text, Shaperglot tells me that the Tiro Devanagari Hindi font fully supports Sanskrit, and that is true for a very large number of Sanskrit texts, but is not true for Vedic texts, for which the Tiro Devanagari Sanskrit font is needed.

Conversely, the new Tiro Sinhala and Tiro Malayalam fonts I tested are each listed as supporting only the eponymous language of their respective scripts, but I can assure you that they support Sanskrit at least as well as Tiro Devanagari Hindi, as well as Pāli.

Lars Törnqvist · November 2024

The characters Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ are not used in the Finnish language. They belong to Skolt Sami, which is spoken in Finland and has local official status in the municipality of Inari.

Denis Moyogo Jacquerye · November 2024

The Unicode CLDR auxiliary exemplar character data is defined as "Additional characters for common foreign words, technical usage" and has the following description:

The main set should contain the minimal set required for users of the language, while the auxiliary exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. Major style guidelines are good references for the auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set. Thus English has the following:

<exemplarCharacters>[a b c d e f g h i j k l m n o p q r s t u v w x y z]</exemplarCharacters>
<exemplarCharacters type="auxiliary">[á à ă â å ä ã ā æ ç é è ĕ ê ë ē í ì ĭ î ï ī ñ ó ò ŏ ô ö ø ō œ ú ù ŭ û ü ū ÿ]</exemplarCharacters>

For a given language, there are a few factors that help for determining whether a character belongs in the auxiliary set, instead of the main set:

The character is not available on all normal keyboards.
It is acceptable to always use spellings that avoid that character.

For example, the exemplar character set for en (English) is the set [a-z]. This set does not contain the accented letters that are sometimes seen in words like "résumé" or "naïve", because it is acceptable in common practice to spell those words without the accents. The exemplar character set for fr (French), on the other hand, must contain those characters: [a-z é è ù ç à â ê î ô û æ œ ë ï ÿ]. The main set typically includes those letters commonly "alphabet".

SIL’s SLDR, Rosetta’s Hyperglot and Simon’s Shaperglot roughly or exactly follow this Unicode CLDR definition of auxiliary for their own auxiliary sets. Interestingly, they do not have the exact same auxiliary set for Finnish.

One can claim to support a language without supporting the auxiliary set. But, if that auxiliary set is tailored properly for the target use, then it should also be supported. The issue here is that it’s unclear which auxiliary set has been tailored for font language support or even what that means. A font for Finnish language text that uses and preserves the Skolt Sami spelling of peoples names does not have the same requirements as a font for Finnish language text that doesn’t.

Kent Lew · November 2024

Lars Törnqvist said:

The characters Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ are not used in the Finnish language. They belong to Skolt Sami,

Exactly. There may be occasions when borrowed Skolt Sami words are used in the context of otherwise Finnish text, particularly in proximity to the Inari region, but that doesn’t make those words part of Finnish.

On the other hand, I would argue that at this point a borrowing like “café” has been fully absorbed into the English language.

All of which is less a criticism of the Shaperglot tool, per se, and more an example of the slippery-ness of the task that seems to be set out. (And an argument against drawing the “fully” line where it’s currently set.)

I do otherwise like the tool, Simon — let it not go unsaid. 😊

Kent Lew · November 2024

As an aside, what would be a cool feature is if, in those cases where the dropped font “fully” supports the language, the sample text was actually displayed in that font.

In those cases where not, the entire display would default back to your font stack. But I imagine that kind of conditionality would be too complicated to implement.

Johannes Neumeier · November 2024

This once again underlines the difference between language and geopolitical entities. "Finnish", the language with a given orthography, which may have imported characters from other languages — versus "Finnish", an official language of Finland, which implicitly may need to support minority languages within the context of the geopolitical entity, albeit of different linguistic origin.

When the loan word is defacto part of the language, then an orthography (and a language checking tool, by extension) needs to account for it. In my opinion you can not simply include load word characters based on the eventuality that a name, place or concept may be using it — most of all Latin languages would share the same orthography then!

Case in point, Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ are not orthographically part of Finnish, but š and ž are, because there are words imported to Finnish which cannot be spelled (and voiced correctly) without them; for some words the official recommendation allows both (shakki and šakki, from a germanic "Schach" for chess) whereas others only know the one spelling (maharadža). (Here, if you care to use a translator.)

Simon Cozens · November 2024

I've made the language around the summary reports more positive ("not supported", "supported", "comprehensively supported").
Accordingly I've changed the yellow pill (for what was previously "not fully supported") to light green, and the "comprehensively supported" (what was "supported") yellow pill to dark green.
I've implemented Kent's suggestion about displaying the sample text in the test font for supported languages.

Thanks for the feedback so far; more suggestions are welcome.

Denis Moyogo Jacquerye · November 2024

@Johannes Neumeier The argument for Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ in names doesn’t seem to hold when looking at corpora actually, they pretty much do not occur as the norm is to use translated forms. Besides ʹ is missing for Skolt Sami names anyway. So they should be removed as suggested.

I’m wondering why Estonian õ is in your Finnish auxiliary set but é isn’t, é is 10 times more common in corpora and does appear in the Finnish last names data set (Swedish spelling forms are used), both appear in the Finnish first names data set.

Yves Michel · November 2024

I usualIy used Alphabet and/or Bulletproof but I like very much Shaperglot.
I find it a bit "pessimistic" by beginning with the not supported languages. I'd prefer to see the supported ones first. But I may be too optimistic!

On this topic, there are explanations of the reasons for the absence (or presence) of Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ in Finnish, for instance. The same for other languages.
It would be interesting if these explanations were provided in Shaperglot.

But a very useful tool, indeed!

Johannes Neumeier · December 2024

@Denis Moyogo Jacquerye Indeed é should be in auxiliary, thanks for pointing out that oversight!

Regarding "The argument for Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ in names" — I was agreeing that they should not, in fact, part of Finnish ("the language") orthography.

John Savard · December 2024

Lars Törnqvist said:

The characters Ǥ, Ʒ, Ǯ, ǥ, ʒ, ǯ are not used in the Finnish language. They belong to Skolt Sami, which is spoken in Finland and has local official status in the municipality of Inari.

Thank you. I did not think that those characters were required to write the Finnish language, as I had never seen them before, even though I had encountered Finnish-language texts.

One could argue that a typeface without French accents doesn't fully support the English language, since sometimes French words are quoted in English texts. (And the recent thread about the Washington Post suggests we should go quite a bit further than that!) Skolt Sami being a minority language, though, it presumably isn't often quoted in ordinary Finnish texts.

John Savard · December 2024

Johannes Neumeier said:

whereas others only know the one spelling (maharadža).

I presume that this is the Finnish form of "Maharaja", a Sanskrit word literally meaning "Great King". I'm surprised, though, given what can be done for Chess, that maharadzha isn't possible. But then, perhaps Finnish can't use the letter "h" in quite all the ways that English does.

Check language support tools

Comments

Categories