Where do I find which glyphs are required for a given language?
Ori Ben-Dor
Posts: 386
Is there a database or some other resource where I can find the glyph set required for a given language?
Thanks!
Thanks!
Tagged:
0
Comments
-
Underware's site may be useful.1
-
-
Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language —what results in several problems. Tools like Alphabets and Underware's Latin Pro are very nice, but not fully reliable because of CLDR.
The DB I'm building is based on several sources. Usually, Omniglot is the more reliable, but it lacks a number of languages. Wikipedia in the language under research is usually a solid source. Letter Database is also very good. And CLDR is useful if combined with other sources.
5 -
Thank you all, that's very helpful!
But all those resources are limited to letters. What about punctuation marks and other glyphs?
@Igor Freiberger, I couldn't find the answer at Omniglot, at least not directly. For example, https://omniglot.com/writing/french.htm#written doesn't mention æ at all. It only mentions Æ, and even that in a note to the Pronunciation table. Am I missing something?
1 -
Rosetta’s Hyperglot is proving to be pretty good, but also does not document punctuation.
For European languages, Michael Everson’s Alphabets of Europe provides information of what form of quotation marks are used.4 -
Regarding French, Omniglot is right when presents the table with 26 letters as the official alphabet. Letters with diacritics and the ligatures æ and œ aren't in the alphabet itself. This is a good example of how misleading the term "alphabet" could be since it doesn't match what you actually need to write the language.
Other sources:
1. Further information on French, including the punctuation (at the end of the article).2. CLDR > Locales: you can usually trust in data under items 4 (Others: numbers) and 5 (Others: punctuation) presented in Native column.0 -
2
-
Michael Rafailyk said:From the Ukranian PDF available on The Alphabets of Europe https://www.evertype.com/alphabets Michael Everson has those letters in square brackets, which means this:
[Square brackets] around a letter indicate that, in the sources, a letter is a) usually listed in the alphabet but is only ever used to represent foreign names and words; b) never or rarely listed in the alphabet (in schoolbooks, for instance) but used to represent foreign names and words; or c) never listed as part of the alphabet but often used to represent foreign names and words.
So it is not inaccurate to list it, just an option.
1 -
It's inconceivable to me that ISO doesn't define standard character sets by language and locale.0
-
@George Thomas Michael Rafailyk is Ukrainian so I believe we can take his critique as correct. 😀
I always see with a lot of suspicion the "letters used only to write foreign names". The whole Latin script fits in this category. Or the whole Cyrillic. Or all Unicode.
If a German newspaper publishes an article about Pelé, it needs the é. Does this make it an auxiliary character for German? An English book about Dvořák needs ř and á. Are they part of an auxiliary English alphabet? Where to draw the line?I prefer to call an orthography the set of characters needed to write a language, including numbers and punctuation, but not considering foreign names.5 -
Typoteque (founded by a Slovak) also lists ŐŰ among Slovak letters, to which Wikipedia says “The letter is still used for this purpose in Slovak phonetic transcription systems”, so I guess be careful when use it as a source, since may has some extras on top of what a regular font is expected to have.0
-
Igor Freiberger said:Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language2
-
It's inconceivable to me that ISO doesn't define standard character sets by language and locale.0
-
Peter Constable said:Igor Freiberger said:Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language
Since CLDR embraces a huge amount of data, maybe the human reviewers are focused on other items. Or maybe this was changed in more recent versions (I saw a voting was used in version 39). Anyway, auxiliary characters are still totally unreliable. Please see this:
For Portuguese, the list of auxiliary characters is
ª ă å ä ā æ è ĕ ë ē ì ĭ î ï ī ñ º ŏ ö ø ō œ ù ŭ û ü ū ÿ
I have a page saved with this list from CLDR 31. Now, in CLDR 39, it still the same.
Let's start with ĭ: a character used for romanization of Mongolian and Khmer. The probability to have frequent use of such a character in zero, not only in Portuguese but in most languages.
Similarly, ă å ä ĕ ë î ï ŏ ö ø ŭ û are never used anywhere in Portuguese. Even in academic papers, these would be quite rare. I believe most of these characters are actually the regulars from the base orthography —á ã é í ó õ and ú— badly recognized from a scan.
The characters ā ē ī ō ū are also never used, but they could become "frequent" if you take a book about the Latin language, when the macron is used to indicate stress. Again, a clear clue about a scanned procedure.
From the list, ª º are truly frequent while ñ is somewhat frequent due to cultural proximity with Spanish-speaking countries. è ò ù were used in old orthography and ö ü are used in German words, eventually present due to German immigration in 1800s. And some cultural proximity between France and Portugal could explain the ÿ. So we have 9 correct entries from 29 characters.
I can see similar problems in other languages. Spanish, for example:
ª à ă â å ä ã ā æ ç è ĕ ê ë ē ì ĭ î ï ī º ò ŏ ô ö ø ō œ ù ŭ û ū ý ÿ
From these, ă å ä ā ĕ ë ĭ î ï ī ò ŏ ō ù ŭ û ū are extremely rare or never used in Spanish, so we have a clear pattern of bad OCR here. By the way, ĭ seems to be a favorite of CLDR since it appears in all languages I did check.
With so much errors, I suppose no human reviewed the lists I did check (Portuguese, Spanish, Galician, Quéchua, French, Italian, and Catalan).
2 -
I’m sorry, but the version of Noto Sans on this forum doesn’t come with combining mark support. We should know better than grab the stripped-down webfont that Google feeds to the naive masses. Shame!@Ori Ben-Dor You can check your font’s language coverage (limited to LGC) using Bulletproof (disclaimer: I’m the author). The data (including the character sets) was cherrypicked by my humble self, checked with various sources (but mostly Wikipedia) in a rather undocumented way, and collected in an npm package language-data.
3 -
Igor Freiberger said:…
With so much errors, I suppose no human reviewed the lists I did check …Whith so much errors a source becomes completely unreliable and quite irrelevant.It is nice to look at some references mentioned (and there are a few good ones), right, but for my own character sets I rely on my own research only. As simple as that.4 -
Andreas Stötzner said:Igor Freiberger said:…
With so much errors, I suppose no human reviewed the lists I did check …Whith so much errors a source becomes completely unreliable and quite irrelevant.It is nice to look at some references mentioned (and there are a few good ones), right, but for my own character sets I rely on my own research only. As simple as that.5 -
@Igor Freiberger, I gave French just as an example, it illustrates the difficulty to extract the information from Omniglot.
@Adam Jagosz, Thanks for suggesting and creating Bulletproof!
Thanks, everyone!1 -
As it happens, on my web site (in response to a news item on the French government looking for a better computer keyboard that supported capital accented letters) I gave an illustration of a potential answer to their quest which was based on my research on what glyphs the French language required...0
-
Jasper de Waard said:… I'd pay good money for a reliable, comprehensive, and well-documented source regarding language support.A good point Jasper, actually. But I hesitate to embark on this, for a couple of reasons.It is one thing for me to compile an encoding and to make some notes about for my own works. But it is another thing if I would consider to sell it and this would require to include a thorough documentation of every detail. Moreover, I’d have to expect that some day that knowledge finds its way into the paste©-sphere, where everyone gets free access to it and the business case is gone. For that reason I do actually believe, this sort of thing ought to be administered by some sort of public standard body.And: the matter is complex. For every language/orthography/country you have at least three potential sections of requirements: a) basic recent; b) historic basic; c) historic advanced; d) scientifical e) &cª &cª. There are things like Pinyin or Arabic transliteration or traditional Irish … one needs to look at it and think it over from case to case, what is needed and what is not.But a concise guide about it would be a nice thing, I agree.
2 -
One complication for any would-be source for language support is deciding on appropriate targets depending on the purpose of the character set definition.
Even simplifying Andreas’ concerns, there is often a substantial difference between:
1) the minimum set of characters (and possibly glyph variants) that should be present to declare that a given font supports language X
2) all the characters (and potentially glyph variants as well) one should put in a font that one wants to support language X
It is of course possible to have a database that distinguishes between the two, or even more levels, and even have additional info about what the uses are of those “extra” characters. But gosh, that is a lot of work. And the whole thing has a great deal of subjectivity to it.
Not saying it is not worth doing; I tried documenting those two ends of it in my character set work for Extensis back in the day.
1 -
And what about an open source approach? Sharing the load would help a lot, I imagine... though someone would probably still have to oversee things.
Some things are definitely subjective, but as long as the debate is visible it is still super helpful.0 -
Open source would probably leave out underserved languages. An open source database would probably consist of data people add if they have time during a project. Since the underserved languages aren’t getting used in many projects nobody would be getting paid to add to the database.
0 -
I would think suggesting any needed corrections to Underware’s or Typotheque's or Rosetta's efforts would be more efficient than trying to build something new from scratch.0
-
James Puckett said:Open source would probably leave out underserved languages. An open source database would probably consist of data people add if they have time during a project. Since the underserved languages aren’t getting used in many projects nobody would be getting paid to add to the database.
So I think it’s kind of reversed, open source lets locals to add their languages, even if it makes no sense for Europeans and Americans.
1 -
There is also ScriptSource, https://scriptsource.org/, which offers a collaborative platform for documenting glyph usage in various writing systems, scripts, languages, language variants, etc. I think the navigation of the website exemplifies the complexity of such a platform. In practice, I tend to find little to no information for most things I am interested in, so I’m unsure a new platform would perform any better.2
-
Someone should say this: providing glyphs for Unicode codepoints does not equate to providing a workable font for a given language system.5
-
Of current projects, I find Rosetta’s Hyperglot the most promising.1
-
Simon Cozens said:Someone should say this: providing glyphs for Unicode codepoints does not equate to providing a workable font for a given language system.
There's also LOCL features, yes ? Anything else ?
0 -
heh. Oh heck yes.
Try looking at Microsoft’s spec for the Universal Shaping Engine, which along the way defines what it expects from fonts. https://docs.microsoft.com/en-us/typography/script-development/use
It is relatively simple for Latin, Greek and Cyrillic. But once you get into “complex scripts” it is … well, complex.
To be honest, this is the most intimidating way to look at this info. But that said, this stuff can get pretty darn complicated.
0
Categories
- All Categories
- 43 Introductions
- 3.7K Typeface Design
- 803 Font Technology
- 1K Technique and Theory
- 622 Type Business
- 444 Type Design Critiques
- 542 Type Design Software
- 30 Punchcutting
- 136 Lettering and Calligraphy
- 83 Technique and Theory
- 53 Lettering Critiques
- 485 Typography
- 303 History of Typography
- 114 Education
- 68 Resources
- 499 Announcements
- 80 Events
- 105 Job Postings
- 148 Type Releases
- 165 Miscellaneous News
- 270 About TypeDrawers
- 53 TypeDrawers Announcements
- 116 Suggestions and Bug Reports