Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language —what results in several problems. Tools like Alphabets and Underware's Latin Pro are very nice, but not fully reliable because of CLDR.
The DB I'm building is based on several sources. Usually, Omniglot is the more reliable, but it lacks a number of languages. Wikipedia in the language under research is usually a solid source. Letter Database is also very good. And CLDR is useful if combined with other sources.
Thank you all, that's very helpful! But all those resources are limited to letters. What about punctuation marks and other glyphs?
@Igor Freiberger, I couldn't find the answer at Omniglot, at least not directly. For example, https://omniglot.com/writing/french.htm#written doesn't mention æ at all. It only mentions Æ, and even that in a note to the Pronunciation table. Am I missing something?
Regarding French, Omniglot is right when presents the table with 26 letters as the official alphabet. Letters with diacritics and the ligatures æ and œ aren't in the alphabet itself. This is a good example of how misleading the term "alphabet" could be since it doesn't match what you actually need to write the language.
Quickly checked it and discovered a some characters (Ёё Ээ) in Ukrainian language that shouldn’t be there. Actually it is a letters from Russian (the same Cyrillic script) so I can’t say it is mistake, just inaccuracy.
Quickly checked it and discovered a some characters (Ёё Ээ) in Ukrainian language that shouldn’t be there. Actually it is a letters from Russian (the same Cyrillic script) so I can’t say it is mistake, just inaccuracy.
From the Ukranian PDF available on The Alphabets of Europehttps://www.evertype.com/alphabets Michael Everson has those letters in square brackets, which means this:
[Square brackets] around a letter indicate that, in the sources, a letter is a) usually listed in the alphabet but is only ever used to represent foreign names and words; b) never or rarely listed in the
alphabet (in schoolbooks, for instance) but used to represent foreign names and words; or c) never listed as part of the alphabet but often used to represent foreign names and words.
So it is not inaccurate to list it, just an option.
@George Thomas Michael Rafailyk is Ukrainian so I believe we can take his critique as correct. 😀
I always see with a lot of suspicion the "letters used only to write foreign names". The whole Latin script fits in this category. Or the whole Cyrillic. Or all Unicode.
If a German newspaper publishes an article about Pelé, it needs the é. Does this make it an auxiliary character for German? An English book about Dvořák needs ř and á. Are they part of an auxiliary English alphabet? Where to draw the line?
I prefer to call an orthography the set of characters needed to write a language, including numbers and punctuation, but not considering foreign names.
Typoteque (founded by a Slovak) also lists ŐŰ among Slovak letters, to which Wikipedia says “The letter is still used for this purpose in Slovak phonetic transcription systems”, so I guess be careful when use it as a source, since may has some extras on top of what a regular font is expected to have.
It's inconceivable to me that ISO doesn't define standard character sets by language and locale.
And it's inconceivable to me how ISO could possibly do that—define standard character sets for languages. ISO 12199 attempted to document characters for several languages, but it's not exactly a success. CLDR is not a standard, but it is more successful, in part because it is not an ISO standard.
Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language
Wherever did you get that idea? All of the data in CLDR gets vetted by human reviewers.
I am not so creative to invent such a strange idea. I read about that in CLDR site, probably during version 31 time. For the auxiliary characters, documents were scanned to identify what was often used to write a language beyond its basic alphabet. I remember quite well that there was even a list of scanned documents by language. For Portuguese, the documents included travel and technical guides, what may explain some of the errors still presented by CLDR today.
Since CLDR embraces a huge amount of data, maybe the human reviewers are focused on other items. Or maybe this was changed in more recent versions (I saw a voting was used in version 39). Anyway, auxiliary characters are still totally unreliable. Please see this:
For Portuguese, the list of auxiliary characters is ª ă å ä ā æ è ĕ ë ē ì ĭ î ï ī ñ º ŏ ö ø ō œ ù ŭ û ü ū ÿ
I have a page saved with this list from CLDR 31. Now, in CLDR 39, it still the same.
Let's start with ĭ: a character used for romanization of Mongolian and Khmer. The probability to have frequent use of such a character in zero, not only in Portuguese but in most languages.
Similarly, ă å ä ĕ ë î ï ŏ ö ø ŭ û are never used anywhere in Portuguese. Even in academic papers, these would be quite rare. I believe most of these characters are actually the regulars from the base orthography —á ã é í ó õ and ú— badly recognized from a scan.
The characters ā ē ī ō ū are also never used, but they could become "frequent" if you take a book about the Latin language, when the macron is used to indicate stress. Again, a clear clue about a scanned procedure.
From the list, ª º are truly frequent while ñ is somewhat frequent due to cultural proximity with Spanish-speaking countries. è ò ù were used in old orthography and ö ü are used in German words, eventually present due to German immigration in 1800s. And some cultural proximity between France and Portugal could explain the ÿ. So we have 9 correct entries from 29 characters.
I can see similar problems in other languages. Spanish, for example: ª à ă â å ä ã ā æ ç è ĕ ê ë ē ì ĭ î ï ī º ò ŏ ô ö ø ō œ ù ŭ û ū ý ÿ
From these, ă å ä ā ĕ ë ĭ î ï ī ò ŏ ō ù ŭ û ū are extremely rare or never used in Spanish, so we have a clear pattern of bad OCR here. By the way, ĭ seems to be a favorite of CLDR since it appears in all languages I did check.
With so much errors, I suppose no human reviewed the lists I did check (Portuguese, Spanish, Galician, Quéchua, French, Italian, and Catalan).
I’m sorry, but the version of Noto Sans on this forum doesn’t come with combining mark support. We should know better than grab the stripped-down webfont that Google feeds to the naive masses. Shame!
@Ori Ben-Dor You can check your font’s language coverage (limited to LGC) using Bulletproof (disclaimer: I’m the author). The data (including the character sets) was cherrypicked by my humble self, checked with various sources (but mostly Wikipedia) in a rather undocumented way, and collected in an npm package language-data.
… With so much errors, I suppose no human reviewed the lists I did check …
Whith so much errors a source becomes completely unreliable and quite irrelevant.
It is nice to look at some references mentioned (and there are a few good ones), right, but for my own character sets I rely on my own research only. As simple as that.
… With so much errors, I suppose no human reviewed the lists I did check …
Whith so much errors a source becomes completely unreliable and quite irrelevant.
It is nice to look at some references mentioned (and there are a few good ones), right, but for my own character sets I rely on my own research only. As simple as that.
Yes very simple but also time-consuming and requiring some experience. If everybody is doing the same thing, some of us are wasting time, no? I'd pay good money for a reliable, comprehensive, and well-documented source regarding language support.
@Igor Freiberger, I gave French just as an example, it illustrates the difficulty to extract the information from Omniglot. @Adam Jagosz, Thanks for suggesting and creating Bulletproof! Thanks, everyone!
As it happens, on my web site (in response to a news item on the French government looking for a better computer keyboard that supported capital accented letters) I gave an illustration of a potential answer to their quest which was based on my research on what glyphs the French language required...
And: the matter is complex. For every language/orthography/country you have at least three potential sections of requirements: a) basic recent; b) historic basic; c) historic advanced; d) scientifical e) &cª &cª. There are things like Pinyin or Arabic transliteration or traditional Irish … one needs to look at it and think it over from case to case, what is needed and what is not.
But a concise guide about it would be a nice thing, I agree.
One complication for any would-be source for language support is deciding on appropriate targets depending on the purpose of the character set definition.
Even simplifying Andreas’ concerns, there is often a substantial difference between: 1) the minimum set of characters (and possibly glyph variants) that should be present to declare that a given font supports language X 2) all the characters (and potentially glyph variants as well) one should put in a font that one wants to support language X
It is of course possible to have a database that distinguishes between the two, or even more levels, and even have additional info about what the uses are of those “extra” characters. But gosh, that is a lot of work. And the whole thing has a great deal of subjectivity to it.
Not saying it is not worth doing; I tried documenting those two ends of it in my character set work for Extensis back in the day.
Open source would probably leave out underserved languages. An open source database would probably consist of data people add if they have time during a project. Since the underserved languages aren’t getting used in many projects nobody would be getting paid to add to the database.
I would think suggesting any needed corrections to Underware’s or Typotheque's or Rosetta's efforts would be more efficient than trying to build something new from scratch.
Open source would probably leave out underserved languages. An open source database would probably consist of data people add if they have time during a project. Since the underserved languages aren’t getting used in many projects nobody would be getting paid to add to the database.
The question is what the whole thing is all about — a practical database for type designers or a database of all languages on the planet? Rarely used languages are not worth adding in the first place (by worth I mean the difference between costs and benefits). So I wouldn’t expect a private research project to cover them either. But for someone local, the costs are lower and benefits are higher than for Europeans / Americans doing research.
So I think it’s kind of reversed, open source lets locals to add their languages, even if it makes no sense for Europeans and Americans.
There is also ScriptSource, https://scriptsource.org/, which offers a collaborative platform for documenting glyph usage in various writing systems, scripts, languages, language variants, etc. I think the navigation of the website exemplifies the complexity of such a platform. In practice, I tend to find little to no information for most things I am interested in, so I’m unsure a new platform would perform any better.
Comments
The DB I'm building is based on several sources. Usually, Omniglot is the more reliable, but it lacks a number of languages. Wikipedia in the language under research is usually a solid source. Letter Database is also very good. And CLDR is useful if combined with other sources.
But all those resources are limited to letters. What about punctuation marks and other glyphs?
@Igor Freiberger, I couldn't find the answer at Omniglot, at least not directly. For example, https://omniglot.com/writing/french.htm#written doesn't mention æ at all. It only mentions Æ, and even that in a note to the Pronunciation table. Am I missing something?
For European languages, Michael Everson’s Alphabets of Europe provides information of what form of quotation marks are used.
Other sources:
1. Further information on French, including the punctuation (at the end of the article).
[Square brackets] around a letter indicate that, in the sources, a letter is a) usually listed in the alphabet but is only ever used to represent foreign names and words; b) never or rarely listed in the alphabet (in schoolbooks, for instance) but used to represent foreign names and words; or c) never listed as part of the alphabet but often used to represent foreign names and words.
So it is not inaccurate to list it, just an option.
I always see with a lot of suspicion the "letters used only to write foreign names". The whole Latin script fits in this category. Or the whole Cyrillic. Or all Unicode.
If a German newspaper publishes an article about Pelé, it needs the é. Does this make it an auxiliary character for German? An English book about Dvořák needs ř and á. Are they part of an auxiliary English alphabet? Where to draw the line?
Since CLDR embraces a huge amount of data, maybe the human reviewers are focused on other items. Or maybe this was changed in more recent versions (I saw a voting was used in version 39). Anyway, auxiliary characters are still totally unreliable. Please see this:
For Portuguese, the list of auxiliary characters is
ª ă å ä ā æ è ĕ ë ē ì ĭ î ï ī ñ º ŏ ö ø ō œ ù ŭ û ü ū ÿ
I have a page saved with this list from CLDR 31. Now, in CLDR 39, it still the same.
Let's start with ĭ: a character used for romanization of Mongolian and Khmer. The probability to have frequent use of such a character in zero, not only in Portuguese but in most languages.
Similarly, ă å ä ĕ ë î ï ŏ ö ø ŭ û are never used anywhere in Portuguese. Even in academic papers, these would be quite rare. I believe most of these characters are actually the regulars from the base orthography —á ã é í ó õ and ú— badly recognized from a scan.
The characters ā ē ī ō ū are also never used, but they could become "frequent" if you take a book about the Latin language, when the macron is used to indicate stress. Again, a clear clue about a scanned procedure.
From the list, ª º are truly frequent while ñ is somewhat frequent due to cultural proximity with Spanish-speaking countries. è ò ù were used in old orthography and ö ü are used in German words, eventually present due to German immigration in 1800s. And some cultural proximity between France and Portugal could explain the ÿ. So we have 9 correct entries from 29 characters.
I can see similar problems in other languages. Spanish, for example:
ª à ă â å ä ã ā æ ç è ĕ ê ë ē ì ĭ î ï ī º ò ŏ ô ö ø ō œ ù ŭ û ū ý ÿ
From these, ă å ä ā ĕ ë ĭ î ï ī ò ŏ ō ù ŭ û ū are extremely rare or never used in Spanish, so we have a clear pattern of bad OCR here. By the way, ĭ seems to be a favorite of CLDR since it appears in all languages I did check.
With so much errors, I suppose no human reviewed the lists I did check (Portuguese, Spanish, Galician, Quéchua, French, Italian, and Catalan).
@Adam Jagosz, Thanks for suggesting and creating Bulletproof!
Thanks, everyone!
Even simplifying Andreas’ concerns, there is often a substantial difference between:
1) the minimum set of characters (and possibly glyph variants) that should be present to declare that a given font supports language X
2) all the characters (and potentially glyph variants as well) one should put in a font that one wants to support language X
It is of course possible to have a database that distinguishes between the two, or even more levels, and even have additional info about what the uses are of those “extra” characters. But gosh, that is a lot of work. And the whole thing has a great deal of subjectivity to it.
Not saying it is not worth doing; I tried documenting those two ends of it in my character set work for Extensis back in the day.
Some things are definitely subjective, but as long as the debate is visible it is still super helpful.
So I think it’s kind of reversed, open source lets locals to add their languages, even if it makes no sense for Europeans and Americans.
There's also LOCL features, yes ? Anything else ?
Try looking at Microsoft’s spec for the Universal Shaping Engine, which along the way defines what it expects from fonts. https://docs.microsoft.com/en-us/typography/script-development/use
It is relatively simple for Latin, Greek and Cyrillic. But once you get into “complex scripts” it is … well, complex.
To be honest, this is the most intimidating way to look at this info. But that said, this stuff can get pretty darn complicated.