Where do I find which glyphs are required for a given language?

2

Comments

  • This is written from the perspective of an engine that handles a font and does text shaping, but it is a pretty good quick read about what kinds of things might be going on: https://harfbuzz.github.io/why-do-i-need-a-shaping-engine.html
  • @Ray Larabie: I wouldn’t say your approach went in the wrong direction, basically. To establish certain categories for characters and then assign (flag) these categories to characters is the right way, i.m.h.o.
    To leave the more simple requirements for display fonts aside for now, I envision a system for sorting things out which starts quite the same way as you describe.

    1. The horizontal dimension of coverage: scripts and languages.
    Determines wether you support Latin, Greek, Math notation, Alchemy or Chess notation. Or IPA, UPA or airport pictographs. Also defines within one of these chapters wether you’re going to support (current) Sami, Vietnamese or Guaranì.
    2. The vertical dimension of coverage: defines how much in depth the support of the given chapter will be:
    • a) basic current
    • b) advanced current/typographic (special ch.s like dutch ij, ligatures, monetaries)
    • c) basic historic (e.g. Greenlandic kra ĸ [not Icelandic k!], long ſ )
    • d) advanced historic (e.g. polytonic Greek, medieval Latin)
    • e) obscure and obsolete (a-ring-acute, L-dot, serbo-croatian digraphs, Drachma).
    3. The choice of parts built under 1. and 2. forms a certain encoding scheme. One can choose either to form a predominant horizontal coverage (e.g. broad basic current Latin languages coverage) or to put emphasis on deeper support for something (e.g. Latin linguistic, biblical Hebrew, ancient Greek alphanumerics).

    Two further remarks.

    I’d not second that every serious text face needs to support IPA characters. When you do this, which has merits of course, you inevitably run into the question of also supporting Uralic or other advanced phonetics (vertical!). Not every typeface has per se to cover the special needs of a particular sort of scientific literature.

    Under 8. you summarize “hobby/fictional/ancient”. – Disagree here. Ok, one may put Klingon, Esperanto or Tengwar under (hobbyist/fictional), why not. But ancient is something completely different. Every historic (dead) writing system (Egyptian, Imperial Aramaic, Disc of Phaistos – you name it) is just as serious a script like Latin or Arabic is. In historic studies and editorial works these scripts are daily business. To choose or not to choose one of them clearly counts under 1.–horizontal, in my opinion.

  • notdef
    notdef Posts: 168
    edited August 2021
    Great work, Igor.

    Ideas:
    • If orthographies have a starting date/ending date, you could filter for the latest active (which might be multiple). You can also target historical texts. 
    • If you allow for uploading/tagging images and files, you can include discussions of particular design requirements and document regional variants of letterforms. If images are also dated, you can now track trends.* The Mac OS Finder offers a good solution for tagging files.

  •    8. Hobby/fictional/ancient: Lord of the Rings, Esperanto, hieroglyphics.
    But Esperanto is not the fictional language.  This is surely a mistake.
  • Ray Larabie
    Ray Larabie Posts: 1,431
    edited August 2021
    Toby Lebarre Not everyone will agree. I categorize Esperanto as non-fictional but primarily for hobbyists or maybe category 2.

    @Andreas Stötzner The difference between historical and ancient is that historical characters might appear in existing electronic documents that may still require support. There are probably no old websites that are written in hieroglyphics. That's more of a specialty use rather than legacy support which I think it would be more useful if classified separately. It's not about how serious it might be but the type of font that would require it. A government website or Gutenberg-style book repository site might want to hang on to historical support to display old documents while not ever needing to include 
    hieroglyphics. I guess a line could be drawn at recently historical and not-recently historical and maybe it could be it its own category.
    I’d not second that every serious text face needs to support IPA characters.
    Agreed. The whole point of this is to be able to categorize the characters and let the type designer or font subsetter decide.
  • John Savard
    John Savard Posts: 1,126
    edited August 2021
    Since books are printed in Esperanto, despite the fact that it is not the first language of the people of any natural linguistic community, it might be considered as important for a text typeface as the characters of many lesser-known natural languages. But as Esperanto is a constructed language, and many other conlangs are in your category 8, perhaps its description could simply be reworded.
    A serious text typeface should support IPA.
    With this, I disagree for one specific reason. Such books as I've seen which make use of the International Phonetic Alphabet tend to put words expressed in that alphabet in a distinctive typeface, and often in bold face. So there doesn't seem to be an issue with the typeface used for the body of a document not including IPA characters, as long as one that does is also available.
    Of course, one could also say that this was an artifact of the lack of IPA support in lead type.
  • Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language
    Wherever did you get that idea? All of the data in CLDR gets vetted by human reviewers.
    I am not so creative to invent such a strange idea. I read about that in CLDR site...
    I can't find that. Can you provide a link?
    Anyway, auxiliary characters are still totally unreliable...
    You're describing the auxiliary sets. The CLDR documentation of exemplars says (emphasis added),
    There are five sets altogether: main, auxiliary, punctuation, numbers, and index. The main set should contain the minimal set required for users of the language, while the auxiliary exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. Major style guidelines are good references for the auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set.
    So, suppose OCR were used in some cases to get data for auxiliary sets. In your Portuguese etc. examples, are you asserting that the OCR process mis-identified characters, or that the samples on which OCR was done is not representative?

    In any case, the concept of auxiliary sets seems like it could be useful for font developers (albeit, "customarily occur in common publications" is vaguely defined). And if you think the data in CLDR for any given language is wrong, you can engage to get it changed.
  • Peter,

    As I said, there are clues of bad OCR, but of course I can't assert if this is why a wrong character was included. Only people from CLDR could inform this. And the pages were changed along the years, many links are dead in previous versions.

    I think that auxiliary sets with mistakes can rarely be useful. The designer will add characters beyond the font scope and maybe omit others necessary. Wrong information is hardly a good thing.

    I checked additional languages and, in some of them, the problem is worse: CLDR took the auxiliary set for English and just repeated it, subtracting the characters which are in the base alphabet. This is really bad.

    I can assure the auxiliary sets include wrong characters for these languages: Asturian*, Basque*, Breton*, Catalan, Galician, Latin, Luxembourgish, Occitan*, Portuguese, Quéchua*, Romansh*, Sardinian, and Spanish.

    In other hand, Aragonian, Corsican, and Lugirian seem to be right.

    * auxiliary sets identical to English, minus the base language alphabet.

  • Thomas Phinney said:
    … it might be possible to get funding through this: https://imminent.translated.com/research-grants

    it looks interesting at first sight, but I looked again and again and could not find anything about who is actually in charge of that entity, who is the head, what sort of organisation it is or where they are based.
    Do you have any insights about that?
  • I have a document from the times I freelanced for Fontshop that is so complete that even Atlantis is listed :)



  • John Savard
    John Savard Posts: 1,126
    edited August 2021
    I have a document from the times I freelanced for Fontshop that is so complete that even Atlantis is listed :)

    I suppose that's one way to catch plagiarism!
    Of course, a naïve individual might also think that Zypern is a fictitious place, right out of science fiction, geographically and linguistically savvy people, like those who post here, would immediately recognize that this is how you say "Cyprus" in German.
    Speaking of science fiction, the Greek alphabet - without diacritics - could also be given as the script of Alpha Centauri!
    On what basis? While the movie Star Trek: First Contact has Zefrem Cochrane developing the warp drive on a post-apocalyptic Earth, according to the original series, he lived on a planet orbiting α Centauri which, like Talos IV, had been colonized by means of sublight travel prior to the invention of warp drive. Inspired by this, the (unofficial, fan-created) Star Trek Technical Manual, after showing Microgramma as the official lettering style for ships like the Enterprise, showed the Greek alphabet in that style as what was to be used for ships from α Centauri.




  • Thomas Phinney said:
    … it might be possible to get funding through this: https://imminent.translated.com/research-grants

    it looks interesting at first sight, but I looked again and again and could not find anything about who is actually in charge of that entity, who is the head, what sort of organisation it is or where they are based.
    Do you have any insights about that?

    I in no way vouch for them, but I gather Imminent is the research arm of Translated, and both are based in Italy.

    More on Imminent at these two links, including a list of key personnel (all with Italian names, I think) at the second link:
    https://imminent.translated.com/about-imminent
    https://imminent.translated.com/imminent-annual-report-2021

    Parent company Translated:
    https://translated.com/about-us
  • I have seen these pages already and still: no adresses, no head, no ‘beef’. The persons under ‘fellows’ seem to be real, but one doesn’t see who is the actual contractor.
    – Hey, they’re asking for your best ideas for ‘half of the price’ upfront.

    Has anyone ever heard about them before?
  • John Savard
    John Savard Posts: 1,126
    I can assure the auxiliary sets include wrong characters for these languages: Asturian*, Basque*, Breton*, Catalan, Galician, Latin, Luxembourgish, Occitan*, Portuguese, Quéchua*, Romansh*, Sardinian, and Spanish.

    According to Wikipedia, at least, Basque is not terribly demanding: just ñ is required (although after some vowels, it is optional), but sometimes ç and ü are also used. However, it is noted that in the form of the Basque alphabet proposed by Sabino Arana, ll and rr were replaced by accented forms of l and r: ĺ and ŕ. Well, they're still in Unicode.

    So one could suggest one, three, or five diacritical characters for Basque and be correct.


  • Than you Simon Cozens for the useful comments. Your note about "Urdu should be set in Nastaliq style" is fine as a general statement but it is not relevant in the context of this discussion. Urdu is actually set in different styles today as well as in the past, and not exclusively in Nastaliq. The selected style should not be considered when deciding language support. 

    • While on the subject of the meta table and language support, having to handle the difference between OpenType script/language tags, BCP 47 script/language tags and ISO639-3 language tags is one truly horrible aspect of this discussion. You just have to know that TRK and tur and tr are the same language, but ROM is actually ron. (And possibly also ro? I'm not sure.) Burn it with fire. Someone, probably me, should write a little Python routine to convert between the three.
    Had the same problem with names of language resources in my collection of corpora. trk versus tr versus tur. You never can be sure if the authors used a standard or just invented a name. tr and tur are BCP47/ISO code for the Turkish language. trk is ISO 639-5 for the language family of Turkic languages. See https://iso639-3.sil.org/code/trk

    ro, rom, rm needs extra caution. The language Romanian has BCP47/ISO ro resp. ron. But it can be mis-named rum. rom is the macro language Romani, not to be confused with Roman (~ancient Latin), Romance, Romanesco, Romang, or Romansh. And mo or mol for Moldavian are now deprecated. It's now ro-MD but mo is still in use.

    It's even more confusing with historical names or ones used by linguistic scholars. Old High German always had the abbreviation OHG resp. AHD (Alt-Hoch-Deutsch). Now ISO is GOH. Many are forgotten like 'carn.' or 'carniol.' used as abbreviation in a Latin book published ~1750. It's proto-Slovenian and Slovenia didn't exist before the end of WW I.

    It's boring and there is no way around converting them by script and lookup tables.
  • John Savard
    John Savard Posts: 1,126
    I was not aware that Moldavian was even related to Romanian; my understanding was that Romanian is a Romance language with a vocabulary largely overlain by Slavic words, whereas Moldavian was a genuine Slavic language. However, I could be completely wrong, as I am not that knowledgeable about the subject. But perhaps the ro-MD code is strictly based on political boundaries, and not linguistics.
  • @John Savard AFAIK Moldavian is a Romanian dialect which developed isolated for political reasons, kept some old vocabulary and adopted Slavic words under Russian influence.

    When I compile wordlists of both languages from corpora of 1 M proper sentences, the most frequent words are the same.
  • John Hudson
    John Hudson Posts: 3,186
    edited August 2021
    You never can be sure if the authors used a standard or just invented a name.
    I can reliably attest that the authors of the OpenType Layout script and language system tags just invented a name. Because I was the person who invented many of the early ones.

    At the time (late 1990s), I asked the project manager at Microsoft with whom I was working if we should use an existing standard for language tagging when assigning new langsys tags, noting that some of those already assigned did not seem to conform to any standard I was aware of. The answer was no, since OTL language system tags were intended to capture writing conventions that do not necessarily map cleanly to languages.*

    Since those days, implementation of OTL langsys in software has tended to shift interpretation of those tags closer to that of language tagging—in recognition of which, ISO 639 codes were added as informative mappings in the OTL langsys registry—, and I now think perhaps we would have been better off basing the tags on ISO standards in some way. Over the years, @Peter Constable has several times suggested adding a mechanism to OTL that would enable ISO language tags to be used directly in addition or in place of OTL langsys tags.

    There remain, however, langsys tags such as IPPH that implement the original intent, i.e. capturing a writing convention—in that case IPA phonetic notation—that do not map to any specific language, and which reveal the problem in assuming that a langsys tag can be treated as a language tag. There are also several langsys tags that map to multiple ISO 639 codes
    _____

    * My favourite example of the original intent of OTL langsys tags are the hypothetical 

    <grek><FRA>
    <grek><DEU>

    which would enable a font to differentiate conventions for writing Greek in French and German academia. As well as illustrating the intent of the langsys tags, this example also serves as a reminder that a langsys tag is always used in combination with a script tag, and that it is the pairing that needs to be interpreted, not just the langsys tag itself.
  • Simon Cozens
    Simon Cozens Posts: 740
    edited August 2021
    John, your description of langsys is at variance with what the spec says about language systems. (Or at best, the spec is unclear). It says:

    language system may modify the functions or appearance of glyphs in a script to represent a particular language.
    (my emphasis)

    So in your grek/FRA example, the grek/FRA combination would apply specifically to text written in the Greek script but in the French language (loan words, maybe) - which is the way most people understand how the script/langsys combination works.

    But you seem to be talking about it in your academia example in quite a different way: text in Greek script, within a broader environment of the French language - say, a document whose base language is French.

    You might be right, or the spec might be right. It's hard to tell because the spec doesn't define clearly means for glyphs to "represent a language"... which is a funny expression, now that I come to think of it.
  • At risk of digressing further off topic...

    I believe John is correct regarding the original intent: <grek><FRA> being Greek-script text in the context of a French-language document. (I seem to recall hearing that at some point directly from the original architect, but can't say for certain.)

    Software and document markup conventions haven't developed in a way that can readily support that model, however. And it competes with a different requirement, which is to select language-specific glyphs (e.g., Serbian italic forms).

    If there is Greek-script text within a French document, current thinking about best practice would be that each run of text is marked up to indicate its language (e.g., xml:lang). So, if the Greek-script text is actually Greek language, you end up with something like

    <body lang="fr">...<span lang="gr">...</span> ...

    If the Greek-script text were French transcribed using Greek script,

    <body lang="fr"> ... <span>...</span> ...

    Of if the Greek-script text is some technical notation (not a human language),

    <body lang="fr">...<span lang="zxx">...</span> ...

    Then there's the software implementation: The easiest thing for software to do would be to select a language system tag for a run based on the language tagging of that run.  That accommodates the user requirement for selecting language specific glyphs (e.g., Serbian italic forms). Many browsers are doing that, but there are some very popular text-layout applications that still do not.

    But now ask app developers to apply a language system tag based on the language of the containing run, or a document primary language. There aren't clear heuristics that could be used to figure out reliably what is the correct choice, especially given the competing requirement for language-specific forms. Moreover, some of the concepts may not fit at all in some contexts: E.g., what's should be considered the document primary language of a diglot publication? I think the only reliable thing would be for users to select the language system directly. But good luck getting many app developers to add that in a way that will be understandable to users!

    I think treating the language system tag as a means for selecting language-specific forms, which is clearly needed, is the only interpretation that could have succeeded.

    My biggest concern with some of the language system tags that have been registered is that it's unclear what they were supposed to mean. In some cases, John might have records from 25 years ago as to what was intended; but the original designers didn't consider the need to document what the intent is for registered tags.

    For example, 'BCR ' = "Bible Cree", or 'MOR ' = "Moroccan": IMO these are unusable (except as private-use tags in a closed environment) because there's no documentation of what they are supposed to mean. So, no font developer knows when these tags might be appropriate in font data (being consistent with what other font developers are doing and what users expect), and no app developer would know when to apply them.
  • John Hudson
    John Hudson Posts: 3,186
    I believe John is correct regarding the original intent: <grek><FRA> being Greek-script text in the context of a French-language document. (I seem to recall hearing that at some point directly from the original architect, but can't say for certain.)

    Yes, it is how Eliyezer explained it to me.

    I think there was always an ambiguity in the langsys concept, in that langsys can—and perhaps in most cases does—correspond to writing convention norms or preferences that map to language in the sense that the term is relevant to things like spelling and grammar checkers, hyphenation, etc.. But those are all character level functions, and OTL operates in glyph space where what matters is having mechanisms to provide users with the appropriate visual forms in a given text, which might be determined by language, or partially determined by language, but also by other factors of content and context.

    That is why I have always argued that there need to be mechanisms to separate langsys from language tagging, even if only exceptionally. I was able to convince the CSS working group to do this at one stage, but I think they might have rolled it back.

    'MOR ' = "Moroccan": IMO these are unusable (except as private-use tags in a closed environment) because there's no documentation of what they are supposed to mean.

    As I recall—which is to say, probably not entirely accurately, because it was a long time ago—, Paul Nelson registered the MOR tag because of some variant shaping for an Arabic letter in Moroccan use that wasn’t captured by Unicode at the time, and this was implemented in early versions of the Arabic Typesetting font. But then Unicode added the letter with variant shaping as a disunified character, so the MOR langsys branch was removed from Arabic Typesetting.

  • John Savard
    John Savard Posts: 1,126
    For example, 'BCR ' = "Bible Cree", or 'MOR ' = "Moroccan": IMO these are unusable (except as private-use tags in a closed environment) because there's no documentation of what they are supposed to mean. So, no font developer knows when these tags might be appropriate in font data (being consistent with what other font developers are doing and what users expect), and no app developer would know when to apply them.

    While the absence of documentation is unfortunate, Morocco is a real place, and in there people speak the Arabic language, and write it, perhaps with certain unique conventions of their own - so, if one makes one's font follow those conventions, as far as one can research them, when the language is described as Moroccan, that should work.
    The Cree syllabary was originally developed by missionaries intent on presenting the Christian religion to the Cree Indians, and so presumably BCR indicates the conventions of early printed materials by these missionaries are to be followed.
    After various font makers do research, and present fonts embodying it, eventually perhaps a standard with documentation will arise, which fonts can be revised to follow. If the tags are never used, though, no one will see a need to improve the situation. Even a de facto standard is better than none.

  • ... Morocco is a real place, and in there people speak the Arabic language, and write it, perhaps with certain unique conventions of their own... 
    The Cree syllabary was originally developed by missionaries intent on presenting the Christian religion to the Cree Indians, and so presumably BCR indicates the conventions of early printed materials by these missionaries are to be followed.
    After various font makers do research, and present fonts embodying it, eventually perhaps a standard with documentation will arise... Even a de facto standard is better than none.

    But there is no de facto standard. Presumably some conventions for Cree related to missionary work was assumed by whoever registered that tag, but who knows now what was intended. A font developer could research and arrive as some conclusions, but other font developers might arrive at different conclusions. Unless it's documented and that documentation is conventionally accepted as The Intent, no one can use it with confidence of interoperability.

    As for Moroccan, of course Morocco is a real place where Arabic is spoken and written. But also Tamazight is spoken and written in Tifinagh script, along with Tachelhit and other Berber languages; and there are multiple typographic conventions for how Tifinagh script is written. Which is intended by 'MOR '? Nobody knows.
  • Nick Shinn
    Nick Shinn Posts: 2,207
    Thomas said:

    It is relatively simple for Latin, Greek and Cyrillic.

    However, Uzbek, Kazakh and Turkmen are in a state of flux.
    There are a number of Cyrillic characters for those languages that I don’t think there’s much point in providing in a normal retail font.
    Better to include the Bulgarian variants.
  • John Hudson
    John Hudson Posts: 3,186
    But there is no de facto standard. Presumably some conventions for Cree related to missionary work was assumed by whoever registered that tag, but who knows now what was intended. A font developer could research and arrive as some conclusions, but other font developers might arrive at different conclusions. Unless it's documented and that documentation is conventionally accepted as The Intent, no one can use it with confidence of interoperability.

    But there is no documented de facto standard for most of the langsys tags that have been registered; indeed, most of those langsys tags will probably never be used in any fonts because there is not, in fact, any difference in glyph shape or behaviour that needs to be distinguished from the dflt script. Most of the registered langsys tags exist as placeholders against the day when some font maker decides they want to implement some novel behaviour for langsys XXX, which may never happen.

    There is a relatively tiny number of langsys tags that have conventional implementation in fonts, mostly to resolve issues around Unicode’s encoding of characters with special behaviours as in ARA and TRK, or semi-deprecated unified encodings as in ROM/MOL, or regional preferences as in BGR, MKD, SRB, MAR.

    Script level glyph distinctions are more common, i.e. variant shapes or behaviours associated with the locl feature within the dflt processing of a script, e.g. the variant forms of U+0304 combining macron and U+0306 combining breve that I just added to a Kannada font for use in prosodic notation.