Latin Extended-B Languages?

Fernando Díaz · September 2015

Hey there,

I wanted to know which languages are part of the Latin Extended-B set. I can find the glyphs but not the languages that they support.

Are they very 'important'? Or should I focus on other languages like Cyrillic/Greek?

Up to now, no client has ever asked for Latin Extended-B, never the less, a lot of 'pro' typefaces seem to have his character set covered.

Thank you

John Hudson · September 2015

The Unicode Latin Extended-B block contains a mix of diacritic characters from different sources:

Non-European and historic
This section includes a lot of letters used in African languages, as well as an assortment of Zhuang Chinese tone letters, a couple of Vietnamese letters (see also the Latin Extended Additional block for Vietnamese tone vowel diacritics), and a small number of archaic letters from older regional European alphabets.

African letters for clicks
Self-explanatory. Used for the Khoisan languages of southern Africa.

Croatian digraphs
These are an historical oddity. They were inherited into Unicode from a Yugoslav 8-bit national standard, and were encoded to provide a one-to-one mapping from the Serbo-Croat Cyrillic alphabet to the Serbo-Croat Latin alphabet. This allowed Yugoslav documents to be easily presented in either orthography simply by changing the font. Croatian nationalism in the 1990s made much of differences between Serbian and Croatian, so these characters are presumably obsolete.

Pinyin diacritic combinations
For Mandarin Chinese romanisation.

Phonetic and historical letters
A few used in African languages, some Uralist phonetic transcription characters, a couple used in one or more Sami aphabets, and more regional historical letters such as wynn.

Additions for Slovenian and Croatian
These are specialist diacritics used in prosody (analysis of metrical and stress patterns in poetry); they are not used for everyday Slovenian and Croatian text.

Additions for Romanian
Disunifying earlier encoding with corresponding Turkic -cedilla diacritics. Important.

Miscellaneous Additions
A couple of regional historic letters, letters from native North American alphabets, and the rest phoneticist characters.

Additions for Livonian
Recently moribund Finnic language; object of study but last native speaker died in 2013.

Additions for Sinology
IPA extensions used in transcription of classical Chinese.

(more) Miscellaneous Additions
More native North American letters, case pair additions for IPA letters previously encoded with only lowercase, odds and ends, dotless j.

My take on this is that there are very few fonts that would need to support the whole block. Unless one is either setting out to support all of Unicode, à la Noto, or providing fonts for broad academic publishing, à la Brill, only a subset of this block is likely to be necessary. The four Romanian diacritics are the most important for a font targeting European languages, as they correct an earlier encoding issue. If your font is supporting Vietnamese, then obviously you'll need the horn letters.

Igor Freiberger · September 2015

Latin Extended-B is needed for

Languages: Romanian, Azeri, Vietnamese, Slovenian (Latin), Croatian (Latin), Sami, Khoisan, Zulu, a number of native american languages from West Canada, and several West-African languages which use the pan-African and pan-Nigerian alphabets. Also supports minority languages which use pan-Turkic alphabet, mainly less known idioms from small comunities inside Russia with roots linked to Latin script.

Transliterations: Pin Yin, Serbian Cyrillic translated to Croatian Latin

Old languages and orthographies: Zhuang, Gothic, Scots, Old Norse, Old English, Old Saxon and also legacy orthographies of West African languages.

Phonetics: sparse additions to IPA, APA and UPA.

Of course, the relevance of this block need to be evaluated in face of your audience and targets. But if you are aiming to wider market, Cyrillic represent more potential licensees.

Fernando Díaz · September 2015

Thanks John and Igor, this really helps!

Denis Moyogo Jacquerye · September 2015

Yup. Don’t do the whole block just for the sake of doing the whole block, unless you’re aiming to cover all Latin characters that are in Unicode.

There are two things you need to consider when looking at Latin Extended-B.

As John says, this is a mix of characters from various sources. The way they are grouped is also problematic, in particular the 'Non-european and historic' group which contains both characters used by millions and character used only in historical documents or document related to them.
As a client, it would make more sense to look for specific characters rather than the whole block.

The other thing is that some of the letters in Latin Extended-B have their uppercase or lowercase in a different Unicode Block, or some letters are only used in orthographies that also use some characters in other Unicode Blocks : IPA Extensions, Latin Extended-C, Latin Extended-D, Combining Marks.

ivan louette · June 2018

Is there a list of the language tags corresponding to latin extended A and B ? Otherwise what must I do to for example if I want that all my subcaps work for any language ?

André G. Isaak · June 2018

Is there a list of the language tags corresponding to latin extended A and B ? Otherwise what must I do to for example if I want that all my subcaps work for any language ?

Latin Extended-A corresponds to Central European encodings. Pretty much all the other Latin Extended-N blocks don’t correspond to any well-defined set of languages. Rather, they are “dumping grounds” for Latin characters that didn’t fit into the first three blocks of latin characters. So it makes far more sense to think first in terms of which languages and special needs (poetics, phonetics, historical uses, etc.) you want to support rather than thinking in terms of unicode blocks.

ivan louette · June 2018

Thanks ! That's a good idea. But for Roman language for example it seems that some characters are lacking in Latin Extended A and are included in Latin Extended B. Thus I should probably check which set of characters are needed for the language I would support.

ivan louette · June 2018

I added the scommaaccent and tcommaaccent (which are part of Extended B ) and now my small caps in Roman language work fine. Thanks again !

Mark Simonson · June 2018

Do you mean "Romanian"?

ivan louette · June 2018

@Mark Simonson Yes sorry

Vasil Stanev · June 2018

I use the OTM by URW++ to check which languages are covered by my font.

Paul Miller · June 2018

There is a database which tells you which characters are needed for which language here and here, the first one is better in my opinion.

ivan louette · June 2018

@Vasil Stanev & @Paul Miller thanks a lot for these resources ! I am a newbie in this area and that's so valuable !

John Hudson · June 2018

OTM and several other tools/resources use Unicode CLDR database. It's a good starting point for mapping characters and languages, but should be used with caution if trying to determine what characters are needed for a language. A good example is the legacy digraph characters that CLDR identifies as Croatian: these were inherited into Unicode from Yugoslav 8-bit encodings that were designed to enable one-to-one transcription between Cyrillic and Latin orthographies for Serbo-Croatian. So in a sense, yes, these are Croatian characters in that they represent sequences of letters used in the Croatian Latin orthography to write phonemes represented by a single letter in Cyrillic, but these legacy characters are not actually needed for Croatian, don't appear on Croatian keyboards and so forth.

ivan louette · June 2018

Thanks again.

And a related question (which could even open another topic) : is it possible to define several different language specific kerning sets into the same font ?

Paul Miller · June 2018

I have defined different kerning sets for different languages with Font Creator, I expect it is also possible with Fontographer.

John Hudson · June 2018

You can define different kerning sets for different languages using OpenType GPOS kerning. In basic terms, you create separate lookups and assign them to different language system tags. Note, however, that application of language-specific kerning is dependent on a) text language being correctly tagged, and b) software recognising the tags and knowing to apply the specific OT language system instead of the default.

Mark Simonson · June 2018

...Fontographer?

ivan louette · June 2018

@John Hudson Thanks a lot ! That's exactly what I expected but I wasn't absolutely sure of that.

Thomas Phinney · June 2018

Fontographer certainly does not handle that. I think you could supplement/replace the usual kerning with manual coding in FontLab 5 or VI to get language-specific variations, if you wished, but you'd need to take care not to overwrite your special code with an auto-generated kern feature.

LeMo aka PatternMan aka Frank E Blokland · June 2018

I use the OTM by URW++ […]

For the record: DTL OTMaster (OTM) is a product of the Dutch Type Library (DTL). OTM is jointly developed with URW Type Foundry (formerly URW++). The programming is done at URW in Hamburg, Germany. DTL and URW work together since 1991.

Paul Miller · June 2018

Mark Simonson said:

...Fontographer?

... or Font Lab or Font Studio or whatever it's called. The other one that isn't Glyphs, you know the one I mean !!

ivan louette · June 2018

@Thomas Phinney Thanks, at the moment I work with FontForge and it can do it (and I am astonished about what it can do). I had Fontographer in the past and I liked it a lot. But there was a long gap (before OTF) where it was unusable by Windows users… and after that time I went to Linux. For some time I was also a (very newbie) Fontlab IV user, and I installed it also successfully on Linux with Wine, but I was already fairly comfortable with FF on Linux and its very useful cut and paste capabilities with Inkscape.

I tried but never used auto-generated kerning in any program even if I understand that it can be an effective first step.

Kent Lew · June 2018

I think what Thomas meant by the “auto-generated kern feature” comment was that most commercial tools compile the {kern} feature out of the source file’s kerning data at the time of generation, and if you’re manually writing a {kern} feature into your .fea file in order to implement some kind of language-specific kerning, you may need to take care to see that the compiler gives precedence to your manual {kern} feature and doesn’t replace it with whatever native kerning data you might have in the source file when compiling.

Denis Moyogo Jacquerye · June 2018

@John Hudson Where does the CLDR identify those legacy characters as Croatian? Maybe that used to be the case but has been fixed since? The default characters exemplar sets of Bosnian, Croatian and Latin Serbian are all [a b c č ć d {dž} đ e f g h i j k l {lj} m n {nj} o p r s š t u v z ž], and the auxiliary characters exemplar sets of Bosnian and Croatian are [q w x y] and that of Latin Serbian [å q w x y].

John Hudson · June 2018

I did see the digraphs recently reported as Croatian characters in tools that I believe use CLDR. It may be an issue of how those tools choose to interpret the {} inclusions.

Michel Boyer · June 2018

What I understand from the Unicode collation chart for Croatian is that the characters between braces need to be taken as a whole in alphabetical ordering of Croatian words.

I seem to be able to select individual characters (I mean select L and J individually) in that chart. On the other hand, if I copy the first line, paste it in a text editor and dump the unicode content of that file, I get those characters

  01C4  LATIN CAPITAL LETTER DZ WITH CARON
  01C5  LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON
  01C6  LATIN SMALL LETTER DZ WITH CARON
  01C7  LATIN CAPITAL LETTER LJ
  01C8  LATIN CAPITAL LETTER L WITH SMALL LETTER J
  01C9  LATIN SMALL LETTER LJ
  01CA  LATIN CAPITAL LETTER NJ
  01CB  LATIN CAPITAL LETTER N WITH SMALL LETTER J
  01CC  LATIN SMALL LETTER NJ

I don't quite understand what is going on. I just checked and those characters appear to be hidden. They are not those that I could select as individual characters (LJ, lj etc).

PS: If I select only the section between LJ and m, I get the following output from hidden text (with blanks removed)

  01C7  LATIN CAPITAL LETTER LJ
  006C  LATIN SMALL LETTER L
  0135  LATIN SMALL LETTER J WITH CIRCUMFLEX
  004C  LATIN CAPITAL LETTER L
  0135  LATIN SMALL LETTER J WITH CIRCUMFLEX
  004C  LATIN CAPITAL LETTER L
  0134  LATIN CAPITAL LETTER J WITH CIRCUMFLEX
  006C  LATIN SMALL LETTER L
  01F0  LATIN SMALL LETTER J WITH CARON
  004C  LATIN CAPITAL LETTER L
  01F0  LATIN SMALL LETTER J WITH CARON

Three hours later: On a better screen at home I can now see those "hidden" characters (and much better with a large font; my eyesight is no longer what it once was).

Michel Boyer · June 2018

The MySQL 8.0 documentation clearly says

Croatian collations are tailored for these Croatian letters: Č, Ć, Dž, Đ, Lj, Nj, Š, Ž.

Three of those letters are digraphs, i.e. composed of two unicode characters (implying, so it seems, that we need to distinguish not only glyphs from characters, but also characters from letters!). There is no guarantee that the corresponding NFKC precomposed characters would be handled properly by MySQL when sorting on fields containing Croatian text and, for inter operability, I would not use the Unicode characters in the 01C4,01CC range in input files.

Thomas Phinney · June 2018

@Kent Lew Yes, exactly what I was trying to say. Thank you for putting it so clearly.

Igor Freiberger · June 2018

Hudson's alert about CLDR is quite important. The CLDR has many errors. Some take it as main reference, like Underware, what makes their Latin Pro inconsistent. Take Portuguese, for example:

Image: https://us.v-cdn.net/5019405/uploads/editor/38/yri37uuwnmnw.jpg

The ò is not part of Portuguese alphabet. It can be called auxiliary to support older (pre-1973) orthographies, but not more than this.
The ü is part of alphabet for countries which still did not adopt the 1990 reform, like Angola or Moçambique. Thus, to list it as auxiliary is wrong.
The other characters (marked) are definetively not used.

Latin Extended-B Languages?

Comments

Categories