Where do I find which glyphs are required for a given language?

Nick Shinn · August 2021

John Hudson wrote:

langsys tags … regional preferences

Examples of this in Latin script are a more vertically inclined Polish kreska (acute) accent, lowered capital dieresis in German, and extra guillemot sidebearings in French.

These are quite rare and discretionary, not a standard.

Peter Constable · August 2021

John Hudson said:
But there is no documented de facto standard for most of the langsys tags...

Fair enough, though at least when there's a mapping to an ISO 639 identifier it's clear in what contexts it would be appropriate to apply. Also, while I don't know about details of tags that were registered very early on, when people have since requested that tags be registered, then I would assume that's because they have distinctions they want to make in fonts.

John Hudson · August 2021

Also, while I don't know about details of tags that were registered very early on, when people have since requested that tags be registered, then I would assume that's because they have distinctions they want to make in fonts.

I was pondering that when considering the requests that Bob Hallisey just submitted. I took a look at the Tamil fonts that are using some of the new langsyst tags, and as far as I can tell from a quick GSUB table review, only two of the four new tags were using locl variants: the other two seemed to mimic the dflt processing.

Of course, though, once you have a model that treats language X as default processing in a font and languages Y and Z as having variant behaviour, you have created the possibility that someone may want to build a font that treats Y or Z as default processing and the others the variants. So unless a de facto standard implies always that a particular language is default, registering all the languages with behaviours that vary from each other becomes necessary.

John Savard · August 2021

Peter Constable said:

But there is no de facto standard.

Well, yes. But if enough font designers bring out fonts purporting to support these language tags, according to their own ideas of what that should entail, eventually a de facto standard for those tags would emerge.

That's what I was viewing as preferable to no attempted implementations at all, on the grounds that "a de facto standard is better than none"; but in the absence of even a de facto standard, incompatiible implementations are also better than no implementations, because a de facto standard cannot be born from silence (of course, a real standard could emerge without preceding implementations, if there were interest enough for some suitable body to develop one, but that is precisely what seems to be absent), but it can emerge from chaos.

Helmut Wollmersdorfer · August 2021

To be fair Unicode CLDR was first released 2003-12-19. In the dark times before technicians (including me) did "something". CLDR allows locale tags based on BCP47, which allow to specify language, region, script, further variants and private extensions.

Thus for e. g. a transcription of Yiddish to German writing system in JIVO standard one can specify:

yi-Latn-x-jivo

or for current US/international YIVO

yi-Latn-x-yivo

For French of France in IPA

fr-FR-x-ipa

For Ancient Greek (polytonic is a AFAIK a reserved tag) of French scholars e. g.

grc-polytonic-x-sorbonne1833

For historical German I use e. g.

de-x-1750-x-longs

And for the transliteration to current alphabet (not orthography)

de-x-1750-x-rounds

For transcription to modern orthography 1901 (1st spelling reform) and 1996 (2nd) are reserved tags:

de-1996-x-1750

But on font level it's IMHO seldom necessary to specify it in such detail. Sure, German Fraktur will need special care for ligatures and accents. Maybe there are a few glyphs needing variants in Czech or Polish Fraktur.

Peter Constable · August 2021

John Hudson said:

Of course, though, once you have a model that treats language X as default processing in a font and languages Y and Z as having variant behaviour, you have created the possibility that someone may want to build a font that treats Y or Z as default processing and the others the variants. So unless a de facto standard implies always that a particular language is default, registering all the languages with behaviours that vary from each other becomes necessary.

Indeed.

Use of langsys to effect substitution of glyphs is, effectively, nearly equivalent to creating separate fonts with distinct names and using font names to select which bundle of glyphs will be displayed. A designer could create a set of fonts with names like "Foo", "Foo Betta Kurumba", "Foo Irula", etc. (see Bob's request), or any number of additional fonts for specific languages. But then if someone comes along wanting to use one of the fonts for some other language written in Tamil script, they have to determine which, if any, of the provided fonts works. If they find that one is suitable, then they can just use it.

But with langsys tags that are selected automatically by layout software based on content metadata (e.g., html lang), they can only get the glyphs needed for that other language if (a) a langsys tag for that other language is added to the font, or (b) the content metadata lies about the actual language of the content.

Fortunately, adding an additional langsys tag to a font, perhaps with corresponding 'locl' feature, is not a lot of additional data that bloats the file size.

John Hudson · August 2021

But with langsys tags that are selected automatically by layout software based on content metadata (e.g., html lang), they can only get the glyphs needed for that other language if (a) a langsys tag for that other language is added to the font, or (b) the content metadata lies about the actual language of the content.

Even if a langsys tag is included in the font with appropriate glyph substitution or other specific behaviour, the whole mechanism relies on too many underspecified things happening in software, such that even if the correct glyphs and shaping for a non-default language are displayed in one place, there is no guarantee that they will elsewhere on the system, or in the same apps on other systems. And when text is copied and pasted between software, language tagging is liable to be lost. It is a fragile mechanism, which is why in recent years I have favoured making separate, language-specific fonts.

Johannes Neumeier · January 2022

@Igor Freiberger sorry to excavate this topic, but if you would be open to sharing some of your collected data, I’d be very keen to compare particularly the orthography data you have gathered to what we have in Hyperglot. We have been quite diligent with discerning between "related glyphs" possibly used in a language in things like loan words or added by historic reasoning or for technical reasons, which some sources list, and actually verifying orthographies and their minimal required charsets. It would be good to see if your data can unearth some things we missed, or confirmed how complete our set is.

While language data and definition of language, orthographies, language use, etc. are very hard problems as well, IMO most difficulty with that bit of the equation is one of defining the framing. And that always carries bias no matter how you wrangle the data.

I agree with several other commenters that the biggest gap in documentation is in the not so easily quantifiable areas of type design: Shaping, localization, font implementation. So far one of the guiding principles in Hyperglot has been to remain font technology agnostic (ironically, while probing specific fonts for support). If you are looking at shaping behaviors (and were to somehow quantify what and what is not needed or adequate) you are still restrained by the envelope of e.g. Opentype and what you can detect with it, and what can be implemented with it.

That said, we would love to include more data of that sort, particularly where it is crucial to the writing system. In that regard, I consider restriction to Unicode the lesser of two evils, but akin to shaping rules implemented in Opentype, this would IMO be some kind of "Unicode plus do X to it" formulation. With or without Opentype, I'd be curious to hear how type designers think of this problem, and if there is something in the type design "lens" of looking at this problem that could be used to synthesize a more general formulation.

Where do I find which glyphs are required for a given language?

Comments

Categories