Where do I find which glyphs are required for a given language?

Thomas Phinney · August 2021

This is written from the perspective of an engine that handles a font and does text shaping, but it is a pretty good quick read about what kinds of things might be going on: https://harfbuzz.github.io/why-do-i-need-a-shaping-engine.html

Ray Larabie · August 2021

I was commissioned to explore a system to help type designers determine a glyph repertoire. The project fizzled out, but I spent a month researching and compiling. I spent a lot of that time going in the wrong direction but here's my conclusion.

A serious text typeface will require different characters than a zany display typeface. A display typeface with Pinyin support would be pointless since those characters are only used in textbooks. A serious text typeface should support IPA. Expecting designers to do their own research for every language will propagate errors. Also, it's difficult to juggle multiple goals. I made a signage font for endangered languages, and it was difficult to determine which characters were orthographic and which were in actual use by native speakers. If you've explored African languages, you already know how hard it to differentiate those characters. Unlike Unicode there can never be a single centralized database because there are political considerations (see flag 7). Not everyone is going to agree what constitutes a historical language or an obsolete character. Those flags will change as historical languages are be revived and others go from endangered to historical.

The simplest solution is a table with a list of Unicode values, flags, and language codes. The list would be ordered by Unicode value. I don't think adding a name value would be practical since that's already part of the Unicode table (shouldn't duplicate anything in that table), but a comment flag could make it more human readable. The supported languages for each character should be listed as codes and translated to the user's language in a separate table. English speakers will see French and French speakers will see Français. Language codes might be better as numeric codes as the table could get unwieldy for common characters such as O. Flags will determine the functions of the character. The flag doesn't describe what each character is, but rather, what type of font would require it. Characters can have multiple flags.

Scholarly: for use in textbooks. Current IPA for example.
Orthographic: for descriptions of languages used for the study of a language but not extensively used outside of that context. Pinyin is an example. Characters used for Latinized biblical transliteration.
Mathematical: beyond the typical plus, minus, multiply, divide etc. Scientific characters likely included.
Deprecated: Officially deprecated characters as well as some that probably shouldn't be included such as L dot. Characters that got added to Unicode but nobody ever used such as dratchma (20AF). I'd include the Greek semicolon as well.
Historical: No longer in current use such as Icelandic k. Sometimes it's hard to determine. When Wikipedia says there are 80 speakers and the last check was in 1990, I think it counts as historical. It's up to the makers of the table to determine where that line is drawn.
Obscure: Oddities such as A ring acute which are theoretically possible but almost never seen in use. DZ digraphs.
Endangered: Characters only used in languages that seem to be dwindling in use. This flag is the main reason that you can't have a centralized system like Unicode. There's no way everyone's going to agree on this flag. You set a threshold and go with it. If I'm making a typeface that's being commissioned for a mobile game, I need to know where to draw the line for language support. Conversely, if I'm making a typeface that supports indigenous languages, I want to ensure that these are included. It feels cruel but that's why we need different versions of this table.
Hobby/fictional/ancient: Lord of the Rings, Esperanto, hieroglyphics.

In type design applications, the user could check the boxes and generate a character set template and/or characters with diacritical marks. A website could use this table to display layers over a Unicode table. Unicode sites such as Unicode Symbol could use this to display information for individual characters. Autokerners could use this table to determine pair priority. Font vendors could use this to determine language coverage and help customers find specialty typefaces. As I said, this project fizzled out, but I figured I may as well share my conclusions since I already put some work into it. I don't have an example of this table. As I stated at the top, I had been going in the wrong direction and that direction was a clunky spreadsheet and I can't share that.

I picture it as something like:

0041 00000000 #A [en, fr, ge, it...]
20AF 00010000 #dratchma [gr]

Andreas Stötzner · August 2021

@Ray Larabie: I wouldn’t say your approach went in the wrong direction, basically. To establish certain categories for characters and then assign (flag) these categories to characters is the right way, i.m.h.o.

To leave the more simple requirements for display fonts aside for now, I envision a system for sorting things out which starts quite the same way as you describe.

1. The horizontal dimension of coverage: scripts and languages.
Determines wether you support Latin, Greek, Math notation, Alchemy or Chess notation. Or IPA, UPA or airport pictographs. Also defines within one of these chapters wether you’re going to support (current) Sami, Vietnamese or Guaranì.
2. The vertical dimension of coverage: defines how much in depth the support of the given chapter will be:

a) basic current
b) advanced current/typographic (special ch.s like dutch ĳ, ligatures, monetaries)
c) basic historic (e.g. Greenlandic kra ĸ [not Icelandic k!], long ſ )
d) advanced historic (e.g. polytonic Greek, medieval Latin)
e) obscure and obsolete (a-ring-acute, L-dot, serbo-croatian digraphs, Drachma).

3. The choice of parts built under 1. and 2. forms a certain encoding scheme. One can choose either to form a predominant horizontal coverage (e.g. broad basic current Latin languages coverage) or to put emphasis on deeper support for something (e.g. Latin linguistic, biblical Hebrew, ancient Greek alphanumerics).

Two further remarks.

I’d not second that every serious text face needs to support IPA characters. When you do this, which has merits of course, you inevitably run into the question of also supporting Uralic or other advanced phonetics (vertical!). Not every typeface has per se to cover the special needs of a particular sort of scientific literature.

Under 8. you summarize “hobby/fictional/ancient”. – Disagree here. Ok, one may put Klingon, Esperanto or Tengwar under (hobbyist/fictional), why not. But ancient is something completely different. Every historic (dead) writing system (Egyptian, Imperial Aramaic, Disc of Phaistos – you name it) is just as serious a script like Latin or Arabic is. In historic studies and editorial works these scripts are daily business. To choose or not to choose one of them clearly counts under 1.–horizontal, in my opinion.

Igor Freiberger · August 2021

[I apologize to the ones who already saw this in my Twitter days ago.]

This is the database I built. It was conceived for my own use, so it includes fields about if and when I would include a given script or language in my own fonts.

Image: https://us.v-cdn.net/5019405/uploads/editor/7t/vb2i2y4mtidc.png

It's a relational database with scripts x families x languages x orthographies x countries x speakers x Unicode blocks. The focus is exactly the question of this thread: glyphs required to a given language. But I decided to add some other basic information about each script and idiom.

Image: https://us.v-cdn.net/5019405/uploads/editor/bm/6hlyqgy3jl3c.png

It took me a while until I modeled the data taking orthographies as the nuclear information. As pointed by Andreas, the information is not unidimensional. A language can use more than one script, a script can embrace several languages, a country can have dozens of languages, a Unicode block can include characters from many languages, etc.

Image: https://us.v-cdn.net/5019405/uploads/editor/j9/7fcn4pnr4dm9.png

In this model, I can include several orthographies indicating when they were adopted. The orthography with no additional name is the current. It also allows to properly register and differentiate ancient or extinct orthographies.

Since I work with FontLab, I added the Generate field —the plain text one can use in FontLab's "Generate Glyphs" feature to quickly add support for this orthography.

By now, I included all languages registered and identified by macOS' FontBook and some random others. When the database reaches a more wide, mature condition, I will publish the lists of languages and orthographies for free use.

Image: https://us.v-cdn.net/5019405/uploads/editor/t4/dq0a81qo8o2i.png

As a final note, each orthography registered also indicates the sources of information. I am extremely grateful to all the people present on this small list (and I hope not to omit names), which is always growing due to the generosity of our niche community.

P.S.

Image: https://us.v-cdn.net/5019405/uploads/editor/gu/p6em4rgn8rgc.png

1. Greenlandic is already on the db and is an example of how good is to treat languages and orthographies as different entities.

2. The database was built with Ninox running locally on macOS. Contents could be exported as csv or spreadsheet. Ninox is not paying me for say that, unhappily.

notdef · August 2021

Great work, Igor.

Ideas:
• If orthographies have a starting date/ending date, you could filter for the latest active (which might be multiple). You can also target historical texts.
• If you allow for uploading/tagging images and files, you can include discussions of particular design requirements and document regional variants of letterforms. If images are also dated, you can now track trends.* The Mac OS Finder offers a good solution for tagging files.

Toby Lebarre · August 2021

Ray Larabie said:

8. Hobby/fictional/ancient: Lord of the Rings, Esperanto, hieroglyphics.

But Esperanto is not the fictional language. This is surely a mistake.

Ray Larabie · August 2021

Toby Lebarre Not everyone will agree. I categorize Esperanto as non-fictional but primarily for hobbyists or maybe category 2.

@Andreas Stötzner The difference between historical and ancient is that historical characters might appear in existing electronic documents that may still require support. There are probably no old websites that are written in hieroglyphics. That's more of a specialty use rather than legacy support which I think it would be more useful if classified separately. It's not about how serious it might be but the type of font that would require it. A government website or Gutenberg-style book repository site might want to hang on to historical support to display old documents while not ever needing to include hieroglyphics. I guess a line could be drawn at recently historical and not-recently historical and maybe it could be it its own category.

I’d not second that every serious text face needs to support IPA characters.

Agreed. The whole point of this is to be able to categorize the characters and let the type designer or font subsetter decide.

John Savard · August 2021

Since books are printed in Esperanto, despite the fact that it is not the first language of the people of any natural linguistic community, it might be considered as important for a text typeface as the characters of many lesser-known natural languages. But as Esperanto is a constructed language, and many other conlangs are in your category 8, perhaps its description could simply be reworded.

Ray Larabie said:

A serious text typeface should support IPA.

With this, I disagree for one specific reason. Such books as I've seen which make use of the International Phonetic Alphabet tend to put words expressed in that alphabet in a distinctive typeface, and often in bold face. So there doesn't seem to be an issue with the typeface used for the body of a document not including IPA characters, as long as one that does is also available.

Of course, one could also say that this was an artifact of the lack of IPA support in lead type.

Peter Constable · August 2021

Igor Freiberger said:

Peter Constable said:

Igor Freiberger said:

Please be careful when using resources based on CLDR, which uses automated OCR to identify all the characters of a given language

Wherever did you get that idea? All of the data in CLDR gets vetted by human reviewers.

I am not so creative to invent such a strange idea. I read about that in CLDR site...

I can't find that. Can you provide a link?

Anyway, auxiliary characters are still totally unreliable...

You're describing the auxiliary sets. The CLDR documentation of exemplars says (emphasis added),

There are five sets altogether: main, auxiliary, punctuation, numbers, and index. The main set should contain the minimal set required for users of the language, while the auxiliary exemplar set is designed to encompass additional characters: those non-native or historical characters that would customarily occur in common publications, dictionaries, and so on. Major style guidelines are good references for the auxiliary set. So, for example, if Irish newspapers and magazines would commonly have Danish names using å, for example, then it would be appropriate to include å in the auxiliary exemplar characters; just not in the main exemplar set.

So, suppose OCR were used in some cases to get data for auxiliary sets. In your Portuguese etc. examples, are you asserting that the OCR process mis-identified characters, or that the samples on which OCR was done is not representative?

In any case, the concept of auxiliary sets seems like it could be useful for font developers (albeit, "customarily occur in common publications" is vaguely defined). And if you think the data in CLDR for any given language is wrong, you can engage to get it changed.

Thomas Phinney · August 2021

Ray Larabie said:

I was commissioned to explore a system to help type designers determine a glyph repertoire. The project fizzled out, but I spent a month researching and compiling.

If @Ray Larabie or anyone else is interested in work in this area, it might be possible to get funding through this: https://imminent.translated.com/research-grants

I believe it would fall under their scope. There are 5 grants of €20,000 available, which should be enough to do something (albeit not everything!)....

Igor Freiberger · August 2021

Peter,

As I said, there are clues of bad OCR, but of course I can't assert if this is why a wrong character was included. Only people from CLDR could inform this. And the pages were changed along the years, many links are dead in previous versions.

I think that auxiliary sets with mistakes can rarely be useful. The designer will add characters beyond the font scope and maybe omit others necessary. Wrong information is hardly a good thing.

I checked additional languages and, in some of them, the problem is worse: CLDR took the auxiliary set for English and just repeated it, subtracting the characters which are in the base alphabet. This is really bad.

I can assure the auxiliary sets include wrong characters for these languages: Asturian*, Basque*, Breton*, Catalan, Galician, Latin, Luxembourgish, Occitan*, Portuguese, Quéchua*, Romansh*, Sardinian, and Spanish.

In other hand, Aragonian, Corsican, and Lugirian seem to be right.

* auxiliary sets identical to English, minus the base language alphabet.

Andreas Stötzner · August 2021

Thomas Phinney said:
… it might be possible to get funding through this: https://imminent.translated.com/research-grants

it looks interesting at first sight, but I looked again and again and could not find anything about who is actually in charge of that entity, who is the head, what sort of organisation it is or where they are based.

Do you have any insights about that?

Ramiro Espinoza · August 2021

I have a document from the times I freelanced for Fontshop that is so complete that even Atlantis is listed

Image: https://us.v-cdn.net/5019405/uploads/editor/3q/t7u2vame140j.png

John Savard · August 2021

Ramiro Espinoza said:

I have a document from the times I freelanced for Fontshop that is so complete that even Atlantis is listed

I suppose that's one way to catch plagiarism!

Of course, a naïve individual might also think that Zypern is a fictitious place, right out of science fiction, geographically and linguistically savvy people, like those who post here, would immediately recognize that this is how you say "Cyprus" in German.

Speaking of science fiction, the Greek alphabet - without diacritics - could also be given as the script of Alpha Centauri!

On what basis? While the movie Star Trek: First Contact has Zefrem Cochrane developing the warp drive on a post-apocalyptic Earth, according to the original series, he lived on a planet orbiting α Centauri which, like Talos IV, had been colonized by means of sublight travel prior to the invention of warp drive. Inspired by this, the (unofficial, fan-created) Star Trek Technical Manual, after showing Microgramma as the official lettering style for ships like the Enterprise, showed the Greek alphabet in that style as what was to be used for ships from α Centauri.

Thomas Phinney · August 2021

Andreas Stötzner said:

Thomas Phinney said:
… it might be possible to get funding through this: https://imminent.translated.com/research-grants

it looks interesting at first sight, but I looked again and again and could not find anything about who is actually in charge of that entity, who is the head, what sort of organisation it is or where they are based.
Do you have any insights about that?

I in no way vouch for them, but I gather Imminent is the research arm of Translated, and both are based in Italy.

More on Imminent at these two links, including a list of key personnel (all with Italian names, I think) at the second link:
https://imminent.translated.com/about-imminent
https://imminent.translated.com/imminent-annual-report-2021

Parent company Translated:
https://translated.com/about-us

Andreas Stötzner · August 2021

I have seen these pages already and still: no adresses, no head, no ‘beef’. The persons under ‘fellows’ seem to be real, but one doesn’t see who is the actual contractor.

– Hey, they’re asking for your best ideas for ‘half of the price’ upfront.

Has anyone ever heard about them before?

John Savard · August 2021

Igor Freiberger said:

I can assure the auxiliary sets include wrong characters for these languages: Asturian*, Basque*, Breton*, Catalan, Galician, Latin, Luxembourgish, Occitan*, Portuguese, Quéchua*, Romansh*, Sardinian, and Spanish.

According to Wikipedia, at least, Basque is not terribly demanding: just ñ is required (although after some vowels, it is optional), but sometimes ç and ü are also used. However, it is noted that in the form of the Basque alphabet proposed by Sabino Arana, ll and rr were replaced by accented forms of l and r: ĺ and ŕ. Well, they're still in Unicode.

So one could suggest one, three, or five diacritical characters for Basque and be correct.

Simon Cozens · August 2021

I'm working on a tool at the moment to analyse whether a font supports a given language ("why another one?" will be answered later), and it's extremely complicated and breaking my brain. I'm going to use this thread to dump down a few thoughts, because I figure those reading this thread are interested in the subject and its complexities.

I don't think the question "does a font support a language?" is actually well-defined. (Which is obviously a problem for what I am doing.) It's certainly not a yes/no question. For example: languages may have multiple orthographies in use, spanning different scripts.

I asked Hyperglot if Noto Sans Balinese supports Balinese, and the answer, amusingly, is "no". This is because Balinese is mostly written in the Latin script these days, not the Balinese script. Sure, if you turn on --include-all-orthographies, Hyperglot will grudgingly admit that Noto Sans Balinese supports Balinese written in the Balinese script.

Conversely, does a Latin-only font - with all the comma accents etc. - support Romanian or Moldovan? Well, not in the Pridnestrovian Moldavian Republic, because there you'd need to be using Cyrillic to write the language.

So I think we can't talk about languages in isolation; we have to ask whether a font has support for a particular script/language combination, possibly with regional variants. (Which of course is how the font's meta table encourages you to describe language support.)
While on the subject of the meta table and language support, having to handle the difference between OpenType script/language tags, BCP 47 script/language tags and ISO639-3 language tags is one truly horrible aspect of this discussion. You just have to know that TRK and tur and tr are the same language, but ROM is actually ron. (And possibly also ro? I'm not sure.) Burn it with fire. Someone, probably me, should write a little Python routine to convert between the three.
We also need to talk about levels of support. As mentioned, it's not a yes/no question. Having glyphs for the required character codepoints is one level. Handling the conventions of the script/language combination is another.

Simple example, Turkish: does a font "support" Turkish just because it has a İ (LATIN CAPITAL LETTER I WITH DOT ABOVE) glyph? It's a start. But if you have a small-caps feature you also need to have layout rules which handle the i/İ conversion. Whether you do that through ccmp or locl or smcp or whatever is up to you (which makes it a pain to analyse), but it has to be there.

As a more puzzling example, does Noto Naskh Arabic "support" Urdu? It does have the right glyph set (tick!) and it has the right shaping rules (tick!) so it can typeset texts in the Urdu language, but they won't look right, because Urdu should be set in Nastaliq style. So that's clearly another level of support.
So you can fill out the Unicode codepoints for Devanagari, say, but still not have a font that supports Hindi. Or even Devanagari. Correct shaping behaviour needs to be present. And there may be others user expectations which need to be fulfilled too - say, the presence and availability of certain expected conjuncts. Probably lots of other things too.

This is why I'm writing another tool - to try to report on what is really needed to support a script/language combination. Not just looking at codepoint coverage, but also checking shaping behaviour and user expectations. And not giving a yes/no answer, but a series of answers about what does and doesn't work.

It's... going to take a while.

Mamoun Sakkal · August 2021

Than you Simon Cozens for the useful comments. Your note about "Urdu should be set in Nastaliq style" is fine as a general statement but it is not relevant in the context of this discussion. Urdu is actually set in different styles today as well as in the past, and not exclusively in Nastaliq. The selected style should not be considered when deciding language support.

Helmut Wollmersdorfer · August 2021

Simon Cozens said:

While on the subject of the meta table and language support, having to handle the difference between OpenType script/language tags, BCP 47 script/language tags and ISO639-3 language tags is one truly horrible aspect of this discussion. You just have to know that TRK and tur and tr are the same language, but ROM is actually ron. (And possibly also ro? I'm not sure.) Burn it with fire. Someone, probably me, should write a little Python routine to convert between the three.

Had the same problem with names of language resources in my collection of corpora. trk versus tr versus tur. You never can be sure if the authors used a standard or just invented a name. tr and tur are BCP47/ISO code for the Turkish language. trk is ISO 639-5 for the language family of Turkic languages. See https://iso639-3.sil.org/code/trk.

ro, rom, rm needs extra caution. The language Romanian has BCP47/ISO ro resp. ron. But it can be mis-named rum. rom is the macro language Romani, not to be confused with Roman (~ancient Latin), Romance, Romanesco, Romang, or Romansh. And mo or mol for Moldavian are now deprecated. It's now ro-MD but mo is still in use.

It's even more confusing with historical names or ones used by linguistic scholars. Old High German always had the abbreviation OHG resp. AHD (Alt-Hoch-Deutsch). Now ISO is GOH. Many are forgotten like 'carn.' or 'carniol.' used as abbreviation in a Latin book published ~1750. It's proto-Slovenian and Slovenia didn't exist before the end of WW I.

It's boring and there is no way around converting them by script and lookup tables.

John Savard · August 2021

I was not aware that Moldavian was even related to Romanian; my understanding was that Romanian is a Romance language with a vocabulary largely overlain by Slavic words, whereas Moldavian was a genuine Slavic language. However, I could be completely wrong, as I am not that knowledgeable about the subject. But perhaps the ro-MD code is strictly based on political boundaries, and not linguistics.

Helmut Wollmersdorfer · August 2021

@John Savard AFAIK Moldavian is a Romanian dialect which developed isolated for political reasons, kept some old vocabulary and adopted Slavic words under Russian influence.

When I compile wordlists of both languages from corpora of 1 M proper sentences, the most frequent words are the same.

John Hudson · August 2021

You never can be sure if the authors used a standard or just invented a name.

I can reliably attest that the authors of the OpenType Layout script and language system tags just invented a name. Because I was the person who invented many of the early ones.

At the time (late 1990s), I asked the project manager at Microsoft with whom I was working if we should use an existing standard for language tagging when assigning new langsys tags, noting that some of those already assigned did not seem to conform to any standard I was aware of. The answer was no, since OTL language system tags were intended to capture writing conventions that do not necessarily map cleanly to languages.*

Since those days, implementation of OTL langsys in software has tended to shift interpretation of those tags closer to that of language tagging—in recognition of which, ISO 639 codes were added as informative mappings in the OTL langsys registry—, and I now think perhaps we would have been better off basing the tags on ISO standards in some way. Over the years, @Peter Constable has several times suggested adding a mechanism to OTL that would enable ISO language tags to be used directly in addition or in place of OTL langsys tags.

There remain, however, langsys tags such as IPPH that implement the original intent, i.e. capturing a writing convention—in that case IPA phonetic notation—that do not map to any specific language, and which reveal the problem in assuming that a langsys tag can be treated as a language tag. There are also several langsys tags that map to multiple ISO 639 codes
_____

* My favourite example of the original intent of OTL langsys tags are the hypothetical

<grek><FRA>
<grek><DEU>

which would enable a font to differentiate conventions for writing Greek in French and German academia. As well as illustrating the intent of the langsys tags, this example also serves as a reminder that a langsys tag is always used in combination with a script tag, and that it is the pairing that needs to be interpreted, not just the langsys tag itself.

Simon Cozens · August 2021

John, your description of langsys is at variance with what the spec says about language systems. (Or at best, the spec is unclear). It says:

A language system may modify the functions or appearance of glyphs in a script to represent a particular language.

(my emphasis)

So in your grek/FRA example, the grek/FRA combination would apply specifically to text written in the Greek script but in the French language (loan words, maybe) - which is the way most people understand how the script/langsys combination works.

But you seem to be talking about it in your academia example in quite a different way: text in Greek script, within a broader environment of the French language - say, a document whose base language is French.

You might be right, or the spec might be right. It's hard to tell because the spec doesn't define clearly means for glyphs to "represent a language"... which is a funny expression, now that I come to think of it.

Peter Constable · August 2021

At risk of digressing further off topic...

I believe John is correct regarding the original intent: <grek><FRA> being Greek-script text in the context of a French-language document. (I seem to recall hearing that at some point directly from the original architect, but can't say for certain.)

Software and document markup conventions haven't developed in a way that can readily support that model, however. And it competes with a different requirement, which is to select language-specific glyphs (e.g., Serbian italic forms).

If there is Greek-script text within a French document, current thinking about best practice would be that each run of text is marked up to indicate its language (e.g., xml:lang). So, if the Greek-script text is actually Greek language, you end up with something like

<body lang="fr">...... ...

If the Greek-script text were French transcribed using Greek script,

<body lang="fr"> ... ... ...

Of if the Greek-script text is some technical notation (not a human language),

<body lang="fr">...... ...

Then there's the software implementation: The easiest thing for software to do would be to select a language system tag for a run based on the language tagging of that run. That accommodates the user requirement for selecting language specific glyphs (e.g., Serbian italic forms). Many browsers are doing that, but there are some very popular text-layout applications that still do not.

But now ask app developers to apply a language system tag based on the language of the containing run, or a document primary language. There aren't clear heuristics that could be used to figure out reliably what is the correct choice, especially given the competing requirement for language-specific forms. Moreover, some of the concepts may not fit at all in some contexts: E.g., what's should be considered the document primary language of a diglot publication? I think the only reliable thing would be for users to select the language system directly. But good luck getting many app developers to add that in a way that will be understandable to users!

I think treating the language system tag as a means for selecting language-specific forms, which is clearly needed, is the only interpretation that could have succeeded.

My biggest concern with some of the language system tags that have been registered is that it's unclear what they were supposed to mean. In some cases, John might have records from 25 years ago as to what was intended; but the original designers didn't consider the need to document what the intent is for registered tags.

For example, 'BCR ' = "Bible Cree", or 'MOR ' = "Moroccan": IMO these are unusable (except as private-use tags in a closed environment) because there's no documentation of what they are supposed to mean. So, no font developer knows when these tags might be appropriate in font data (being consistent with what other font developers are doing and what users expect), and no app developer would know when to apply them.

John Hudson · August 2021

I believe John is correct regarding the original intent: <grek><FRA> being Greek-script text in the context of a French-language document. (I seem to recall hearing that at some point directly from the original architect, but can't say for certain.)

Yes, it is how Eliyezer explained it to me.

I think there was always an ambiguity in the langsys concept, in that langsys can—and perhaps in most cases does—correspond to writing convention norms or preferences that map to language in the sense that the term is relevant to things like spelling and grammar checkers, hyphenation, etc.. But those are all character level functions, and OTL operates in glyph space where what matters is having mechanisms to provide users with the appropriate visual forms in a given text, which might be determined by language, or partially determined by language, but also by other factors of content and context.

That is why I have always argued that there need to be mechanisms to separate langsys from language tagging, even if only exceptionally. I was able to convince the CSS working group to do this at one stage, but I think they might have rolled it back.

'MOR ' = "Moroccan": IMO these are unusable (except as private-use tags in a closed environment) because there's no documentation of what they are supposed to mean.

As I recall—which is to say, probably not entirely accurately, because it was a long time ago—, Paul Nelson registered the MOR tag because of some variant shaping for an Arabic letter in Moroccan use that wasn’t captured by Unicode at the time, and this was implemented in early versions of the Arabic Typesetting font. But then Unicode added the letter with variant shaping as a disunified character, so the MOR langsys branch was removed from Arabic Typesetting.

John Savard · August 2021

Peter Constable said:

For example, 'BCR ' = "Bible Cree", or 'MOR ' = "Moroccan": IMO these are unusable (except as private-use tags in a closed environment) because there's no documentation of what they are supposed to mean. So, no font developer knows when these tags might be appropriate in font data (being consistent with what other font developers are doing and what users expect), and no app developer would know when to apply them.

While the absence of documentation is unfortunate, Morocco is a real place, and in there people speak the Arabic language, and write it, perhaps with certain unique conventions of their own - so, if one makes one's font follow those conventions, as far as one can research them, when the language is described as Moroccan, that should work.

The Cree syllabary was originally developed by missionaries intent on presenting the Christian religion to the Cree Indians, and so presumably BCR indicates the conventions of early printed materials by these missionaries are to be followed.

After various font makers do research, and present fonts embodying it, eventually perhaps a standard with documentation will arise, which fonts can be revised to follow. If the tags are never used, though, no one will see a need to improve the situation. Even a de facto standard is better than none.

Peter Constable · August 2021

John Savard said:

... Morocco is a real place, and in there people speak the Arabic language, and write it, perhaps with certain unique conventions of their own...
The Cree syllabary was originally developed by missionaries intent on presenting the Christian religion to the Cree Indians, and so presumably BCR indicates the conventions of early printed materials by these missionaries are to be followed.
After various font makers do research, and present fonts embodying it, eventually perhaps a standard with documentation will arise... Even a de facto standard is better than none.

But there is no de facto standard. Presumably some conventions for Cree related to missionary work was assumed by whoever registered that tag, but who knows now what was intended. A font developer could research and arrive as some conclusions, but other font developers might arrive at different conclusions. Unless it's documented and that documentation is conventionally accepted as The Intent, no one can use it with confidence of interoperability.

As for Moroccan, of course Morocco is a real place where Arabic is spoken and written. But also Tamazight is spoken and written in Tifinagh script, along with Tachelhit and other Berber languages; and there are multiple typographic conventions for how Tifinagh script is written. Which is intended by 'MOR '? Nobody knows.

Nick Shinn · August 2021

Thomas said:

It is relatively simple for Latin, Greek and Cyrillic.

However, Uzbek, Kazakh and Turkmen are in a state of flux.
There are a number of Cyrillic characters for those languages that I don’t think there’s much point in providing in a normal retail font.
Better to include the Bulgarian variants.

John Hudson · August 2021

But there is no de facto standard. Presumably some conventions for Cree related to missionary work was assumed by whoever registered that tag, but who knows now what was intended. A font developer could research and arrive as some conclusions, but other font developers might arrive at different conclusions. Unless it's documented and that documentation is conventionally accepted as The Intent, no one can use it with confidence of interoperability.

But there is no documented de facto standard for most of the langsys tags that have been registered; indeed, most of those langsys tags will probably never be used in any fonts because there is not, in fact, any difference in glyph shape or behaviour that needs to be distinguished from the dflt script. Most of the registered langsys tags exist as placeholders against the day when some font maker decides they want to implement some novel behaviour for langsys XXX, which may never happen.

There is a relatively tiny number of langsys tags that have conventional implementation in fonts, mostly to resolve issues around Unicode’s encoding of characters with special behaviours as in ARA and TRK, or semi-deprecated unified encodings as in ROM/MOL, or regional preferences as in BGR, MKD, SRB, MAR.

Script level glyph distinctions are more common, i.e. variant shapes or behaviours associated with the locl feature within the dflt processing of a script, e.g. the variant forms of U+0304 combining macron and U+0306 combining breve that I just added to a Kannada font for use in prosodic notation.

Where do I find which glyphs are required for a given language?

Comments

Categories