Where do I find which glyphs are required for a given language?

Ori Ben-Dor · July 2021

Is there a database or some other resource where I can find the glyph set required for a given language?
Thanks!

Craig Eliason · July 2021

Underware's site may be useful.

Alex Visi · July 2021

Also https://www.typotheque.com/fonts/languages/

Igor Freiberger · July 2021

Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language —what results in several problems. Tools like Alphabets and Underware's Latin Pro are very nice, but not fully reliable because of CLDR.

The DB I'm building is based on several sources. Usually, Omniglot is the more reliable, but it lacks a number of languages. Wikipedia in the language under research is usually a solid source. Letter Database is also very good. And CLDR is useful if combined with other sources.

Ori Ben-Dor · July 2021

Thank you all, that's very helpful!
But all those resources are limited to letters. What about punctuation marks and other glyphs?

@Igor Freiberger, I couldn't find the answer at Omniglot, at least not directly. For example, https://omniglot.com/writing/french.htm#written doesn't mention æ at all. It only mentions Æ, and even that in a note to the Pronunciation table. Am I missing something?

John Hudson · July 2021

Rosetta’s Hyperglot is proving to be pretty good, but also does not document punctuation.

For European languages, Michael Everson’s Alphabets of Europe provides information of what form of quotation marks are used.

Igor Freiberger · July 2021

Regarding French, Omniglot is right when presents the table with 26 letters as the official alphabet. Letters with diacritics and the ligatures æ and œ aren't in the alphabet itself. This is a good example of how misleading the term "alphabet" could be since it doesn't match what you actually need to write the language.

Other sources:
1. Further information on French, including the punctuation (at the end of the article).

2. CLDR > Locales: you can usually trust in data under items 4 (Others: numbers) and 5 (Others: punctuation) presented in Native column.

Michael Rafailyk · July 2021

https://www.typotheque.com/fonts/languages/

Quickly checked it and discovered a some characters (Ёё Ээ) in Ukrainian language that shouldn’t be there. Actually it is a letters from Russian (the same Cyrillic script) so I can’t say it is mistake, just inaccuracy.

George Thomas · July 2021

Michael Rafailyk said:

https://www.typotheque.com/fonts/languages/

Quickly checked it and discovered a some characters (Ёё Ээ) in Ukrainian language that shouldn’t be there. Actually it is a letters from Russian (the same Cyrillic script) so I can’t say it is mistake, just inaccuracy.

From the Ukranian PDF available on The Alphabets of Europe https://www.evertype.com/alphabets Michael Everson has those letters in square brackets, which means this:

[Square brackets] around a letter indicate that, in the sources, a letter is a) usually listed in the alphabet but is only ever used to represent foreign names and words; b) never or rarely listed in the alphabet (in schoolbooks, for instance) but used to represent foreign names and words; or c) never listed as part of the alphabet but often used to represent foreign names and words.

So it is not inaccurate to list it, just an option.

Oliver Weiss (Walden Font Co.) · July 2021

It's inconceivable to me that ISO doesn't define standard character sets by language and locale.

Igor Freiberger · July 2021

@George Thomas Michael Rafailyk is Ukrainian so I believe we can take his critique as correct. 😀

I always see with a lot of suspicion the "letters used only to write foreign names". The whole Latin script fits in this category. Or the whole Cyrillic. Or all Unicode.

If a German newspaper publishes an article about Pelé, it needs the é. Does this make it an auxiliary character for German? An English book about Dvořák needs ř and á. Are they part of an auxiliary English alphabet? Where to draw the line?

I prefer to call an orthography the set of characters needed to write a language, including numbers and punctuation, but not considering foreign names.

Alex Visi · July 2021

Typoteque (founded by a Slovak) also lists ŐŰ among Slovak letters, to which Wikipedia says “The letter is still used for this purpose in Slovak phonetic transcription systems”, so I guess be careful when use it as a source, since may has some extras on top of what a regular font is expected to have.

Peter Constable · July 2021

Igor Freiberger said:

Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language

Wherever did you get that idea? All of the data in CLDR gets vetted by human reviewers.

Peter Constable · July 2021

Oliver Weiss (Walden Font Co.) said:

It's inconceivable to me that ISO doesn't define standard character sets by language and locale.

And it's inconceivable to me how ISO could possibly do that—define standard character sets for languages. ISO 12199 attempted to document characters for several languages, but it's not exactly a success. CLDR is not a standard, but it is more successful, in part because it is not an ISO standard.

Igor Freiberger · July 2021

Peter Constable said:

Igor Freiberger said:

Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language

Wherever did you get that idea? All of the data in CLDR gets vetted by human reviewers.

I am not so creative to invent such a strange idea. I read about that in CLDR site, probably during version 31 time. For the auxiliary characters, documents were scanned to identify what was often used to write a language beyond its basic alphabet. I remember quite well that there was even a list of scanned documents by language. For Portuguese, the documents included travel and technical guides, what may explain some of the errors still presented by CLDR today.

Since CLDR embraces a huge amount of data, maybe the human reviewers are focused on other items. Or maybe this was changed in more recent versions (I saw a voting was used in version 39). Anyway, auxiliary characters are still totally unreliable. Please see this:

For Portuguese, the list of auxiliary characters is
ª ă å ä ā æ è ĕ ë ē ì ĭ î ï ī ñ º ŏ ö ø ō œ ù ŭ û ü ū ÿ

I have a page saved with this list from CLDR 31. Now, in CLDR 39, it still the same.

Let's start with ĭ: a character used for romanization of Mongolian and Khmer. The probability to have frequent use of such a character in zero, not only in Portuguese but in most languages.

Similarly, ă å ä ĕ ë î ï ŏ ö ø ŭ û are never used anywhere in Portuguese. Even in academic papers, these would be quite rare. I believe most of these characters are actually the regulars from the base orthography —á ã é í ó õ and ú— badly recognized from a scan.

The characters ā ē ī ō ū are also never used, but they could become "frequent" if you take a book about the Latin language, when the macron is used to indicate stress. Again, a clear clue about a scanned procedure.

From the list, ª º are truly frequent while ñ is somewhat frequent due to cultural proximity with Spanish-speaking countries. è ò ù were used in old orthography and ö ü are used in German words, eventually present due to German immigration in 1800s. And some cultural proximity between France and Portugal could explain the ÿ. So we have 9 correct entries from 29 characters.

I can see similar problems in other languages. Spanish, for example:
ª à ă â å ä ã ā æ ç è ĕ ê ë ē ì ĭ î ï ī º ò ŏ ô ö ø ō œ ù ŭ û ū ý ÿ

From these, ă å ä ā ĕ ë ĭ î ï ī ò ŏ ō ù ŭ û ū are extremely rare or never used in Spanish, so we have a clear pattern of bad OCR here. By the way, ĭ seems to be a favorite of CLDR since it appears in all languages I did check.

With so much errors, I suppose no human reviewed the lists I did check (Portuguese, Spanish, Galician, Quéchua, French, Italian, and Catalan).

Adam Jagosz · July 2021

I’m sorry, but the version of Noto Sans on this forum doesn’t come with combining mark support. We should know better than grab the stripped-down webfont that Google feeds to the naive masses. Shame!

Image: https://us.v-cdn.net/5019405/uploads/editor/y3/8am3hc7erb13.png

@Ori Ben-Dor You can check your font’s language coverage (limited to LGC) using Bulletproof (disclaimer: I’m the author). The data (including the character sets) was cherrypicked by my humble self, checked with various sources (but mostly Wikipedia) in a rather undocumented way, and collected in an npm package language-data.

Andreas Stötzner · July 2021

Igor Freiberger said:

…
With so much errors, I suppose no human reviewed the lists I did check …

Whith so much errors a source becomes completely unreliable and quite irrelevant.

It is nice to look at some references mentioned (and there are a few good ones), right, but for my own character sets I rely on my own research only. As simple as that.

Jasper de Waard · July 2021

Andreas Stötzner said:

Igor Freiberger said:

…
With so much errors, I suppose no human reviewed the lists I did check …

Whith so much errors a source becomes completely unreliable and quite irrelevant.
It is nice to look at some references mentioned (and there are a few good ones), right, but for my own character sets I rely on my own research only. As simple as that.

Yes very simple but also time-consuming and requiring some experience. If everybody is doing the same thing, some of us are wasting time, no? I'd pay good money for a reliable, comprehensive, and well-documented source regarding language support.

Ori Ben-Dor · July 2021

@Igor Freiberger, I gave French just as an example, it illustrates the difficulty to extract the information from Omniglot.
@Adam Jagosz, Thanks for suggesting and creating Bulletproof!
Thanks, everyone!

John Savard · July 2021

As it happens, on my web site (in response to a news item on the French government looking for a better computer keyboard that supported capital accented letters) I gave an illustration of a potential answer to their quest which was based on my research on what glyphs the French language required...

Andreas Stötzner · July 2021

Jasper de Waard said:

… I'd pay good money for a reliable, comprehensive, and well-documented source regarding language support.

A good point Jasper, actually. But I hesitate to embark on this, for a couple of reasons.

It is one thing for me to compile an encoding and to make some notes about for my own works. But it is another thing if I would consider to sell it and this would require to include a thorough documentation of every detail. Moreover, I’d have to expect that some day that knowledge finds its way into the paste&copy-sphere, where everyone gets free access to it and the business case is gone. For that reason I do actually believe, this sort of thing ought to be administered by some sort of public standard body.

And: the matter is complex. For every language/orthography/country you have at least three potential sections of requirements: a) basic recent; b) historic basic; c) historic advanced; d) scientifical e) &cª &cª. There are things like Pinyin or Arabic transliteration or traditional Irish … one needs to look at it and think it over from case to case, what is needed and what is not.

But a concise guide about it would be a nice thing, I agree.

Thomas Phinney · July 2021

One complication for any would-be source for language support is deciding on appropriate targets depending on the purpose of the character set definition.

Even simplifying Andreas’ concerns, there is often a substantial difference between:
1) the minimum set of characters (and possibly glyph variants) that should be present to declare that a given font supports language X
2) all the characters (and potentially glyph variants as well) one should put in a font that one wants to support language X

It is of course possible to have a database that distinguishes between the two, or even more levels, and even have additional info about what the uses are of those “extra” characters. But gosh, that is a lot of work. And the whole thing has a great deal of subjectivity to it.

Not saying it is not worth doing; I tried documenting those two ends of it in my character set work for Extensis back in the day.

Jasper de Waard · July 2021

And what about an open source approach? Sharing the load would help a lot, I imagine... though someone would probably still have to oversee things.

Some things are definitely subjective, but as long as the debate is visible it is still super helpful.

James Puckett · August 2021

Open source would probably leave out underserved languages. An open source database would probably consist of data people add if they have time during a project. Since the underserved languages aren’t getting used in many projects nobody would be getting paid to add to the database.

Craig Eliason · August 2021

I would think suggesting any needed corrections to Underware’s or Typotheque's or Rosetta's efforts would be more efficient than trying to build something new from scratch.

Alex Visi · August 2021

James Puckett said:

Open source would probably leave out underserved languages. An open source database would probably consist of data people add if they have time during a project. Since the underserved languages aren’t getting used in many projects nobody would be getting paid to add to the database.

The question is what the whole thing is all about — a practical database for type designers or a database of all languages on the planet? Rarely used languages are not worth adding in the first place (by worth I mean the difference between costs and benefits). So I wouldn’t expect a private research project to cover them either. But for someone local, the costs are lower and benefits are higher than for Europeans / Americans doing research.

So I think it’s kind of reversed, open source lets locals to add their languages, even if it makes no sense for Europeans and Americans.

Florian Pircher · August 2021

There is also ScriptSource, https://scriptsource.org/, which offers a collaborative platform for documenting glyph usage in various writing systems, scripts, languages, language variants, etc. I think the navigation of the website exemplifies the complexity of such a platform. In practice, I tend to find little to no information for most things I am interested in, so I’m unsure a new platform would perform any better.

Simon Cozens · August 2021

Someone should say this: providing glyphs for Unicode codepoints does not equate to providing a workable font for a given language system.

John Hudson · August 2021

Of current projects, I find Rosetta’s Hyperglot the most promising.

Toby Lebarre · August 2021

Simon Cozens said:

Someone should say this: providing glyphs for Unicode codepoints does not equate to providing a workable font for a given language system.

There's also LOCL features, yes ? Anything else ?

Thomas Phinney · August 2021

heh. Oh heck yes.

Try looking at Microsoft’s spec for the Universal Shaping Engine, which along the way defines what it expects from fonts. https://docs.microsoft.com/en-us/typography/script-development/use

It is relatively simple for Latin, Greek and Cyrillic. But once you get into “complex scripts” it is … well, complex.

To be honest, this is the most intimidating way to look at this info. But that said, this stuff can get pretty darn complicated.

Where do I find which glyphs are required for a given language?

Comments

Categories