Where do I find which glyphs are required for a given language?

Ori Ben-Dor
Ori Ben-Dor Posts: 386
edited July 2021 in Resources
Is there a database or some other resource where I can find the glyph set required for a given language?
Thanks!
«13

Comments

  • Craig Eliason
    Craig Eliason Posts: 1,436
    Underware's site may be useful. 
  • Ori Ben-Dor
    Ori Ben-Dor Posts: 386
    edited July 2021
    Thank you all, that's very helpful!
    But all those resources are limited to letters. What about punctuation marks and other glyphs?

    @Igor Freiberger, I couldn't find the answer at Omniglot, at least not directly. For example, https://omniglot.com/writing/french.htm#written doesn't mention æ at all. It only mentions Æ, and even that in a note to the Pronunciation table. Am I missing something? 
  • John Hudson
    John Hudson Posts: 3,186
    Rosetta’s Hyperglot is proving to be pretty good, but also does not document punctuation.

    For European languages, Michael Everson’s Alphabets of Europe provides information of what form of quotation marks are used.
  • Igor Freiberger
    Igor Freiberger Posts: 273
    edited July 2021
    Regarding French, Omniglot is right when presents the table with 26 letters as the official alphabet. Letters with diacritics and the ligatures æ and œ aren't in the alphabet itself. This is a good example of how misleading the term "alphabet" could be since it doesn't match what you actually need to write the language.

    Other sources:
    1. Further information on French, including the punctuation (at the end of the article).
    2. CLDR > Locales: you can usually trust in data under items 4 (Others: numbers) and 5 (Others: punctuation) presented in Native column.
  • Michael Rafailyk
    Michael Rafailyk Posts: 146
    edited July 2021
    Quickly checked it and discovered a some characters (Ёё Ээ) in Ukrainian language that shouldn’t be there. Actually it is a letters from Russian (the same Cyrillic script) so I can’t say it is mistake, just inaccuracy.
  • George Thomas
    George Thomas Posts: 645
    edited July 2021
    Quickly checked it and discovered a some characters (Ёё Ээ) in Ukrainian language that shouldn’t be there. Actually it is a letters from Russian (the same Cyrillic script) so I can’t say it is mistake, just inaccuracy.
    From the Ukranian PDF available on The Alphabets of Europe https://www.evertype.com/alphabets Michael Everson has those letters in square brackets, which means this:

    [Square brackets] around a letter indicate that, in the sources, a letter is a) usually listed in the alphabet but is only ever used to represent foreign names and words; b) never or rarely listed in the alphabet (in schoolbooks, for instance) but used to represent foreign names and words; or c) never listed as part of the alphabet but often used to represent foreign names and words.

    So it is not inaccurate to list it, just an option.

  • It's inconceivable to me that ISO doesn't define standard character sets by language and locale.
  • Alex Visi
    Alex Visi Posts: 185
    Typoteque (founded by a Slovak) also lists ŐŰ among Slovak letters, to which Wikipedia says “The letter is still used for this purpose in Slovak phonetic transcription systems”, so I guess be careful when use it as a source, since may has some extras on top of what a regular font is expected to have.
  • Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language
    Wherever did you get that idea? All of the data in CLDR gets vetted by human reviewers.
  • It's inconceivable to me that ISO doesn't define standard character sets by language and locale.
    And it's inconceivable to me how ISO could possibly do that—define standard character sets for languages. ISO 12199 attempted to document characters for several languages, but it's not exactly a success. CLDR is not a standard, but it is more successful, in part because it is not an ISO standard.
  • Igor Freiberger
    Igor Freiberger Posts: 273
    edited July 2021
    Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language
    Wherever did you get that idea? All of the data in CLDR gets vetted by human reviewers.
    I am not so creative to invent such a strange idea. I read about that in CLDR site, probably during version 31 time. For the auxiliary characters, documents were scanned to identify what was often used to write a language beyond its basic alphabet. I remember quite well that there was even a list of scanned documents by language. For Portuguese, the documents included travel and technical guides, what may explain some of the errors still presented by CLDR today.

    Since CLDR embraces a huge amount of data, maybe the human reviewers are focused on other items. Or maybe this was changed in more recent versions (I saw a voting was used in version 39). Anyway, auxiliary characters are still totally unreliable. Please see this:

    For Portuguese, the list of auxiliary characters is
    ª ă å ä ā æ è ĕ ë ē ì ĭ î ï ī ñ º ŏ ö ø ō œ ù ŭ û ü ū ÿ

    I have a page saved with this list from CLDR 31. Now, in CLDR 39, it still the same.

    Let's start with ĭ: a character used for romanization of Mongolian and Khmer. The probability to have frequent use of such a character in zero, not only in Portuguese but in most languages.

    Similarly, ă å ä ĕ ë î ï ŏ ö ø ŭ û are never used anywhere in Portuguese. Even in academic papers, these would be quite rare. I believe most of these characters are actually the regulars from the base orthography  —á ã é í ó õ and ú— badly recognized from a scan.

    The characters ā ē ī ō ū are also never used, but they could become "frequent" if you take a book about the Latin language, when the macron is used to indicate stress. Again, a clear clue about a scanned procedure.

    From the list, ª º are truly frequent while ñ is somewhat frequent due to cultural proximity with Spanish-speaking countries. è ò ù were used in old orthography and ö ü are used in German words, eventually present due to German immigration in 1800s. And some cultural proximity between France and Portugal could explain the ÿ. So we have 9 correct entries from 29 characters.

    I can see similar problems in other languages. Spanish, for example:
    ª à ă â å ä ã ā æ ç è ĕ ê ë ē ì ĭ î ï ī º ò ŏ ô ö ø ō œ ù ŭ û ū ý ÿ

    From these, 
    ă å ä ā ĕ ë ĭ î ï ī ò ŏ ō ù ŭ û ū are extremely rare or never used in Spanish, so we have a clear pattern of bad OCR here. By the way, ĭ seems to be a favorite of CLDR since it appears in all languages I did check.

    With so much errors, I suppose no human reviewed the lists I did check (Portuguese, Spanish, Galician, Quéchua, French, Italian, and Catalan).
  • Adam Jagosz
    Adam Jagosz Posts: 689
    edited July 2021
    I’m sorry, but the version of Noto Sans on this forum doesn’t come with combining mark support. We should know better than grab the stripped-down webfont that Google feeds to the naive masses. Shame!

    @Ori Ben-Dor You can check your font’s language coverage (limited to LGC) using Bulletproof (disclaimer: I’m the author). The data (including the character sets) was cherrypicked by my humble self, checked with various sources (but mostly Wikipedia) in a rather undocumented way, and collected in an npm package language-data.

  • With so much errors, I suppose no human reviewed the lists I did check …

    Whith so much errors a source becomes completely unreliable and quite irrelevant.
    It is nice to look at some references mentioned (and there are a few good ones), right, but for my own character sets I rely on my own research only. As simple as that.
  • Ori Ben-Dor
    Ori Ben-Dor Posts: 386
    @Igor Freiberger, I gave French just as an example, it illustrates the difficulty to extract the information from Omniglot.
    @Adam Jagosz, Thanks for suggesting and creating Bulletproof!
    Thanks, everyone!
  • John Savard
    John Savard Posts: 1,126
    As it happens, on my web site (in response to a news item on the French government looking for a better computer keyboard that supported capital accented letters) I gave an illustration of a potential answer to their quest which was based on my research on what glyphs the French language required...


  • … I'd pay good money for a reliable, comprehensive, and well-documented source regarding language support.
    A good point Jasper, actually. But I hesitate to embark on this, for a couple of reasons.
    It is one thing for me to compile an encoding and to make some notes about for my own works. But it is another thing if I would consider to sell it and this would require to include a thorough documentation of every detail. Moreover, I’d have to expect that some day that knowledge finds its way into the paste&copy-sphere, where everyone gets free access to it and the business case is gone. For that reason I do actually believe, this sort of thing ought to be administered by some sort of public standard body.
    And: the matter is complex. For every language/orthography/country you have at least three potential sections of requirements: a) basic recent; b) historic basic; c) historic advanced; d) scientifical e) &cª &cª. There are things like Pinyin or Arabic transliteration or traditional Irish … one needs to look at it and think it over from case to case, what is needed and what is not.
    But a concise guide about it would be a nice thing, I agree.

  • Thomas Phinney
    Thomas Phinney Posts: 2,883
    edited July 2021
    One complication for any would-be source for language support is deciding on appropriate targets depending on the purpose of the character set definition.

    Even simplifying Andreas’ concerns, there is often a substantial difference between:
    1) the minimum set of characters (and possibly glyph variants) that should be present to declare that a given font supports language X
    2) all the characters (and potentially glyph variants as well) one should put in a font that one wants to support language X

    It is of course possible to have a database that distinguishes between the two, or even more levels, and even have additional info about what the uses are of those “extra” characters. But gosh, that is a lot of work. And the whole thing has a great deal of subjectivity to it.

    Not saying it is not worth doing; I tried documenting those two ends of it in my character set work for Extensis back in the day.
  • And what about an open source approach? Sharing the load would help a lot, I imagine... though someone would probably still have to oversee things.

    Some things are definitely subjective, but as long as the debate is visible it is still super helpful. 
  • James Puckett
    James Puckett Posts: 1,992
    Open source would probably leave out underserved languages. An open source database would probably consist of data people add if they have time during a project. Since the underserved languages aren’t getting used in many projects nobody would be getting paid to add to the database.
  • Craig Eliason
    Craig Eliason Posts: 1,436
    I would think suggesting any needed corrections to Underware’s or Typotheque's or Rosetta's efforts would be more efficient than trying to build something new from scratch.
  • Alex Visi
    Alex Visi Posts: 185
    edited August 2021
    Open source would probably leave out underserved languages. An open source database would probably consist of data people add if they have time during a project. Since the underserved languages aren’t getting used in many projects nobody would be getting paid to add to the database.
    The question is what the whole thing is all about — a practical database for type designers or a database of all languages on the planet? Rarely used languages are not worth adding in the first place (by worth I mean the difference between costs and benefits). So I wouldn’t expect a private research project to cover them either. But for someone local, the costs are lower and benefits are higher than for Europeans / Americans doing research.

    So I think it’s kind of reversed, open source lets locals to add their languages, even if it makes no sense for Europeans and Americans.
  • There is also ScriptSource, https://scriptsource.org/, which offers a collaborative platform for documenting glyph usage in various writing systems, scripts, languages, language variants, etc. I think the navigation of the website exemplifies the complexity of such a platform. In practice, I tend to find little to no information for most things I am interested in, so I’m unsure a new platform would perform any better.
  • John Hudson
    John Hudson Posts: 3,186
    Of current projects, I find Rosetta’s Hyperglot the most promising.
  • Someone should say this: providing glyphs for Unicode codepoints does not equate to providing a workable font for a given language system. 

    There's also LOCL features, yes ?  Anything else ?
  • heh. Oh heck yes.

    Try looking at Microsoft’s spec for the Universal Shaping Engine, which along the way defines what it expects from fonts.  https://docs.microsoft.com/en-us/typography/script-development/use

    It is relatively simple for Latin, Greek and Cyrillic. But once you get into “complex scripts” it is … well, complex.

    To be honest, this is the most intimidating way to look at this info. But that said, this stuff can get pretty darn complicated.