Cyrillics I really need to bother with

Ray Larabie · December 2015

While there's a time and a place for a fully decked-out Cyrillic Unicode range, I'm trying to come up with some better choices for where to draw the line.

I've noticed a lot of fonts have a limited Cyrillic set that goes from 0400 to 045F. 0460 to 0489 are historical glyphs so I probably won't bother with those. But I don't know much about the 048A to 04F9 range. I know the 0490 Ґ and 0491 ґ are used in Ukrainian so I'll start including those in my Cyrillic set. Are there any other characters in that range which I should definitely include or not bother with?

Image: https://us.v-cdn.net/5019405/uploads/editor/ub/px41miadpzm9.png

Christoph Koeberlin · December 2015

I always found @Thomas Phinney’s research for Adobe quite helpful.

Ray Larabie · December 2015

Here's an updated image with with the glyphs mentioned in Thomas' article. In case my goal wasn't clear: I'm trying to decide what a slightly more ambitious Cyrillic set could look like...ending at 045F as many fonts do seems like a waste since, with just a few more glyphs, so many more languages could be covered. But I don't want to waste my time adding historical glyphs. I also want to avoid supporting languages that are almost extinct or transitioning to Latin. It's cruel but I can't support every language with every font.

Image: https://us.v-cdn.net/5019405/uploads/editor/x7/6rzslz5tkjdu.png

Kent Lew · December 2015

Not that long ago, I had occasion to attempt to sort out something similar. I wound up parsing through the data in this Wikipedia entry: https://en.wikipedia.org/wiki/List_of_Cyrillic_letters

And cross-checking with data here: http://www.eki.ee/letter/

George Thomas · December 2015

I've been researching this same thing. The page Kent references on Wiki is an excellent one, and another Wiki page I found quite useful, especially the comparison chart at the very bottom, is here:
https://en.wikipedia.org/wiki/Cyrillic_alphabets

There is also the Bulgarian design differences issue which involves some of the lowercase.

Thomas Phinney · December 2015

A couple of years ago, I took my research mentioned previously, plus some up-to-date input from the good folks at Adobe, and created FontLab .enc files for Adobe Cyrillic 1, 2 and 3. 107, 155, and 251 glyphs, respectively. https://github.com/tphinney/font-tools

Kent Lew · December 2015

FWIW, here is what I accumulated, formatted as a Python dict:

cyrlLangChars = {
# Slavic Languages
'Russian': u'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя',
'Belarusian': u'АБВГДЕЁЖЗІЙКЛМНОПРСТУЎФХЦЧШЫЬЭЮЯабвгдеёжзійклмнопрстуўфхцчшыьэюя',
'Ukrainian': u'АБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯабвгґдеєжзиіїйклмнопрстуфхцчшщьюя',
'Rusyn': u'АБВГҐДЕЁЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЪЫЬЮЯабвгґдеёєжзиіїйклмнопрстуфхцчшщъыьюя',
'Serbian': u'АБВГДЂЕЖЗИЈКЛЉМНЊОПРСТЋУФХЦЧЏШабвгдђежзијклљмнњопрстћуфхцчџш',
'Bulgarian': u'АБВГДЕЖЗИЍЙКЛМНОПРСТУФХЦЧШЩЪЬЮЯабвгдежзиѝйклмнопрстуфхцчшщъьюя',  # Ѝѝ -- for disambiguation of feminine possessive pronoun
'Montenegrin': u'АБВГДЂЕЖЗИЈКЛЉМНЊОПРСТЋУФХЦЧЏШ́абвгдђежзијклљмнњопрстћуфхцчџш',
'Macedonian': u'АБВГЃДЕЀЖЗЅИЍЈКЛЉМНЊОПРСТЌУФХЦЧЏШабвгѓдеѐжзѕиѝјклљмнњопрстќуфхцчџш', # ЀЍѐѝ -- for disambiguation
# Other Indo-European/Romance Languages
'Moldovan': u'АБВГДЕЖӁЗИЙКЛМНОПРСТУФХЦЧШЫЬЭЮЯабвгдежӂзийклмнопрстуфхцчшыьэюя',
# Iranian Languages
'Kurdish': u'АБВГДЕӘЖЗИЙКЛМНОÖПРСТУФХҺЧШЩЬЭԚԜабвгдеәжзийклмноöпрстуфхһчшщьэԛԝ',
'Ossetian': u'АӔБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯаӕбвгдеёжзийклмнопрстуфхцчшщъыьэюя',
'Tajik': u'АБВГҒДЕЁЖЗИӢЙКҚЛМНОПРСТУӮФХҲЦЧҶШЩЪЫЬЭЮЯабвгғдеёжзиӣйкқлмнопрстуӯфхҳцчҷшщъыьэюя', # ЦЩЫЬцщыь -- loanwords only
# Uralic Languages
'Kildin Sami': u'АӒБВГДЕЁЖЗИЙҊЈКЛӅМӍНӉӇОПРҎСТУФХҺЦЧШЩЪЫЬҌЭӬЮЯаӓбвгдеёжзийҋјклӆмӎнӊӈопрҏстуфхһцчшщъыьҍэӭюя', # cmb macron may be required
'Komi-Permyak': u'АБВГДЕЁЖЗИІЙКЛМНОӦПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзиійклмноӧпрстуфхцчшщъыьэюя',
'Meadow Mari': u'АБВГДЕЁЖЗИЙКЛМНҤОӦПРСТУӰФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнҥоӧпрстуӱфхцчшщъыьэюя',
'Hill Mari': u'АӒБВГДЕЁЖЗИЙКЛМНОӦПРСТУӰФХЦЧШЩЪЫӸЬЭЮЯаӓбвгдеёжзийклмноӧпрстуӱфхцчшщъыӹьэюя',
'Udmurt': u'АБВГДЕЁЖӜЗӞИӤЙКЛМНОӦПРСТУФХЦЧӴШЩЪЫЬЭЮЯабвгдеёжӝзӟиӥйклмноӧпрстуфхцчӵшщъыьэюя',
'Khanty': u'АӒӘӚБВГДЕЁЖЗИЙКӃЛМНӇОӦӨӪПРСТУӰФХЦЧШЩЪЫЬЭЮЯаӓәӛбвгдеёжзийкӄлмнӈоӧөӫпрстуӱфхцчшщъыьэюя',
'Nenets': u'АБВГДЕЁЖЗИЙКЛМНӇОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнӈопрстуфхцчшщъыьэюя',
# Caucasian Languages
'Abkhaz': u'АБВГӶҔДЕҼҾЖЗӠИКҚҞЛМНОҨПҦРСТҬУФХҲЦҴЧҶЏШЫЬабвгӷҕдеҽҿжзӡикқҟлмноҩпҧрстҭуфхҳцҵчҷџшыь',
'Kabardian': u'АБВГДЕЖЗИӀЙКЛМНОПРСТУФХЦЧШЩЪЫЬЮЯабвгдежзиӏйклмнопрстуфхцчшщъыьюя',
'Chechen': u'АБВГДЕЁЖЗИӀЙКЛМНОПРСТУФХЦЧШЪЫЬЭЮЯабвгдеёжзиӏйклмнопрстуфхцчшъыьэюя',
# Turkic Languages
'Azerbaijani': u'АӘБВГҒДЕЖЗИЙЈКҜЛМНОӨПРСТУҮФХҺЧҸШЫаәбвгғдежзийјкҝлмноөпрстуүфхһчҹшы',
'Turkmen': u'АӘБВГДЕЁЖҖЗИЙКЛМНҢОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯаәбвгдеёжҗзийклмнңоөпрстуүфхцчшщъыьэюя',
'Kazakh': u'АӘБВГҒДЕЁЖЗИІЙКҚЛМНҢОӨПРСТУҮҰФХҺЦЧШЩЪЫЬЭЮЯаәбвгғдеёжзиійкқлмнңоөпрстуүұфхһцчшщъыьэюя',
'Kyrgyz': u'АБВГДЕЁЖЗИЙКЛМНҢОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнңоөпрстуүфхцчшщъыьэюя', # ВФЦЩЪЬвфцщъь -- loanwords only
'Karachay': u'АБВГДЕЁЖЗИЙКЛМНОПРСТУЎФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуўфхцчшщъыьэюя',
'Bashkir': u'АӘБВГҒДЕЁЖЗҘИЙКҠЛМНҢОӨПРСҪТУҮФХҺЦЧШЩЪЫЬЭЮЯаәбвгғдеёжзҙийкҡлмнңоөпрсҫтуүфхһцчшщъыьэюя',
'Tatar': u'АӘБВГДЕЁЖҖЗИЙКЛМНҢОӨПРСТУҮФХҺЦЧШЩЪЫЬЭЮЯаәбвгдеёжҗзийклмнңоөпрстуүфхһцчшщъыьэюя',
'Altai': u'АБВГДЕЁЖЗИЙЈКЛМНҤОӦПРСТУӰФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийјклмнҥоӧпрстуӱфхцчшщъыьэюя',
'Khakass': u'АБВГҒДЕЁЖЗИІЙКЛМНҢОӦПРСТУӰФХЦЧӋШЩЪЫЬЭЮЯабвгғдеёжзиійклмнңоӧпрстуӱфхцчӌшщъыьэюя',
'Sakha': u'АБВГҔДЕЁЖЗИЙКЛМНҤОӨПРСТУҮФХҺЦЧШЩЪЫЬЭЮЯабвгҕдеёжзийклмнҥоөпрстуүфхһцчшщъыьэюя',
'Tuvin': u'АБВГДЕЁЖЗИЙКЛМНҢОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнңоөпрстуүфхцчшщъыьэюя',
'Uzbek': u'АБВГҒДЕЁЖЗИЙКҚЛМНОПРСТУЎФХҲЦЧШЩЪЬЭЮЯабвгғдеёжзийкқлмнопрстуўфхҳцчшщъьэюя',
'Uyghur': u'АӘБВГҒДЕЖҖЗИЙКҚЛМНҢОӨПРСТУҮФХҺЧШЮЯаәбвгғдежҗзийкқлмнңоөпрстуүфхһчшюя',
'Chuvash': u'АӐБВГДЕЁӖЖЗИЙКЛМНОПРСҪТУӲФХЦЧШЩЪЫЬЭЮЯаӑбвгдеёӗжзийклмнопрсҫтуӳфхцчшщъыьэюя',
'Evenki': u'АБВГДЕЁЖЗИЙКЛМНӇОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнӈопрстуфхцчшщъыьэюя',
# Mongolian Languages
'Buryat': u'АБВГДЕЁЖЗИЙКЛМНОӨПРСТУҮФХҺЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмноөпрстуүфхһцчшщъыьэюя', # КФЩЪкфщъ -- loanwords only
'Khalkha': u'АБВГДЕЁЖЗИЙКЛМНОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмноөпрстуүфхцчшщъыьэюя',
'Kalmyk': u'АӘБВГДЕЁЖҖЗИЙКЛМНҢОӨПРСТУҮФХҺЦЧШЩЫЬЭЮЯаәбвгдеёжҗзийклмнңоөпрстуүфхһцчшщыьэюя',
# Sino-Tibetan Languages
'Dungan': u'АӘБВГДЕЁЖҖЗИЙКЛМНҢОПРСТУЎҮФХЦЧШЩЪЫЬЭЮЯаәбвгдеёжҗзийклмнңопрстуўүфхцчшщъыьэюя',
}

I make no warranty that this is either entirely complete or completely accurate. It is assembled from multiple Internet sources, as described above.

Note that for many of these, Cyrillic may no longer be the primary script. There is plenty of politics wrapped up in languages/dialects and scripts. Also note that there are alternate spellings or even alternate names for many of these. I’ve done what I can to try to identify the preferred name/spelling.

Those who are adept at Python can convert this to Unicode codepoints, as desired.

George Thomas · December 2015

Thank you, Kent.

Ray Larabie · December 2015

Thanks, everybody. The glyphs I've highlighted seem to cover all of these except for these two languages.

Abkhaz (7000 speakers/22 glyphs) ӶҔҼҾӠҞҨҦҬҲҴӷҕҽҿӡҟҩҧҭҳҵ
Kildin Sami (600 speakers/18 glyphs) ЙҊӅӍӉӇҎҌӬйҋӆӎӊӈҏҍӭ

Kent Lew · December 2015

Frode — Thanks for directing me to that PDF. Kildin Sami was definitely one of the more difficult to find any consistent information for. It was hard to ascertain whether the lengthened vowels were truly alphabetic. And most of those do not have codepoints anyway.

I see that in the version of my data I posted above, the Ӣӣ and Ӯӯ went missing. You’ll note my hedged comment at the end of that line about the combining macron. As far as I can tell the ӢӣӮӯ were encoded in Unicode for Tajik; but the rest of the “macroned” vowels for Kildin Sami were never included.

Hard to tell what’s the best approach in a situation like this, where only a few of a pattern of related characters are encoded and the rest must be achieved with combining accents.

But yeah, I suppose the precomposed ӢӣӮӯ should have been left in the listing. They are necessary but not sufficient. (Which may be true of some of the other langs as well; such is the lot of so-called “minority” languages.)

Joon Park · December 2015

Is there reason to add Palochka glyph when people just type 1 or uppercase i in practice because there's no keyboard layout for it?

Ray Larabie · December 2015

@Joon Park I came here to ask exactly the same question. If you click View all you can see how palochkas are represented in different fonts.

http://www.fileformat.info/info/unicode/char/04c0/fontsupport.htm
http://www.fileformat.info/info/unicode/char/04cF/fontsupport.htm

Joon Park · December 2015

Ray Larabie said:

@Joon Park I came here to ask exactly the same question. If you click View all you can see how palochkas are represented in different fonts.

http://www.fileformat.info/info/unicode/char/04c0/fontsupport.htm
http://www.fileformat.info/info/unicode/char/04cF/fontsupport.htm

Great to know many fonts still include them regardless.

It's bit confusing though, I see lowercase being represented as uppercase i as well as lowercase L. Which is correct both in practice and semantically?

Ray Larabie · December 2015

Although the wiki doesn't clearly state what the deal is with the lowercase form, there's some helpful background on the talk page. https://en.wikipedia.org/wiki/Talk:Palochka

I couldn't even find a web page that displays a lowercase palochka in a sentence. There are probably very few people on the planet who can answer this.

Kent Lew · December 2015

FWIW, the note about palochka in the Unicode Standard says this:

Palochka. U+04C0 “I” CYRILLIC LETTER PALOCHKA is used in Cyrillic orthographies for a number of Caucasian languages, such as Adyghe, Avar, Chechen, and Kabardian. The name palochka itself is based on the Russian word for “stick,” referring to the shape of the letter. The glyph for palochka is usually indistinguishable from an uppercase Latin “I” or U+0406 “I” CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I; however, in some serifed fonts it may be displayed without serifs to make it more visually distinct.

In use, palochka typically modifies the reading of a preceding letter, indicating that it is an ejective. The palochka is generally caseless and should retain its form even in lowercased Cyrillic text. However, there is some evidence of distinctive lowercase forms; for those instances, U+04CF CYRILLIC SMALL LETTER PALOCHKA may be used.

Joon Park · December 2015

Kent Lew said:

FWIW, the note about palochka in the Unicode Standard says this:
Palochka. U+04C0 “I” CYRILLIC LETTER PALOCHKA is used in Cyrillic orthographies for a number of Caucasian languages, such as Adyghe, Avar, Chechen, and Kabardian. The name palochka itself is based on the Russian word for “stick,” referring to the shape of the letter. The glyph for palochka is usually indistinguishable from an uppercase Latin “I” or U+0406 “I” CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I; however, in some serifed fonts it may be displayed without serifs to make it more visually distinct.

In use, palochka typically modifies the reading of a preceding letter, indicating that it is an ejective. The palochka is generally caseless and should retain its form even in lowercased Cyrillic text. However, there is some evidence of distinctive lowercase forms; for those instances, U+04CF CYRILLIC SMALL LETTER PALOCHKA may be used.

does the lowercase palochka resemble an lowercase of L?

Andreas Stötzner · December 2015

When you dig that much deep into the “which for what” issue: maybe you’re quicker arriving at a comprehensive solution when you do just all characters and thus be sure no one would miss anything – ?

Chris Lozos · December 2015

you’re quicker arriving at a comprehensive solution

This is also true of Joon Park post about Greek glyphs. I am pretty sure you can draw the glyph quicker than you can unearth the reasonable degree of usage.

Kent Lew · December 2015

does the lowercase palochka resemble an lowercase of L?

I do not think the matter is settled at all. The very notion of a lowercase palochka seems to be a matter of debate.

We would need native speakers to weigh in on current preferences.

Ray Larabie · December 2015

Since many of us here will spend the rest of our lives filling in these Unicode ranges over and over again, it pays to spend more time working out what to include, what not to include. If you're working on a long-term project where you intend to fill everything in, go for it. Filling in Unicode ranges without knowing how/if glyphs are ever going to be used wastes time in the long term and bloats fonts. New type designers, unsure of which glyphs they should include may look to existing fonts for guidance which perpetuates wrong/junk glyphs. Except in the case of comprehensive language fonts, we all decide which characters we're going to support and which characters we're not going to support. Knowing which forms are historical or deprecated is important in making those decisions.

In the case of textured/distressed fonts. There's a breaking limit to the number of non-composite glyphs that can be included. Knowing which glyphs are deprecated, historical or rarely used can contribute to more language coverage and more stable fonts.

For example, the long s: ſ. Even beginners know that this is an historical glyph. It's certainly appropriate in a comprehensive cover-everything font, a old-timey Caslon or a distressed pirate themed font but in an ultramodern design, it's clutter. There's a deprecated character right in the middle of Latin Extended A that we all know about: ŉ. There it is at the top of table 2. http://unicode.org/review/pr-122.html
A lot of new fonts still include this glyph, not because of it's usefulness, but because it just happens to be in the middle of Latin extended A.

Feel free to fill in everything if you want but perhaps we shouldn't use Unicode tables to decide where to stop.

Joon Park · December 2015

0344 ( ̈́ ) COMBINING GREEK DIALYTIKA TONOS *
037E ( ; ) GREEK QUESTION MARK *
0387 ( · ) GREEK ANO TELEIA *
20A4 ( ₤ ) LIRA SIGN
2126 ( Ω ) OHM SIGN *

Interesting proposal list, so discourage of use because of lack of practice or duplicate glyphs are in existence?

Edit: NM, had to look up Normalization Form C.

Wei Huang · May 2016

Does anyone have reliable documentation on the localised forms of Cyrillic?

Stefan Peev · July 2017

Wei Huang I have started such a kind of documentation. Look at the Local Fonts (here). And here are the local forms by languages – Bulgarian Cyrillic Feature Locl, Serbian Cyrillic Feature Locl, Macedonian Cyrillic Feature Locl, Bashkir Cyrillic Feature Locl, Chuvash Cyrillic Feature Locl.

John Savard · July 2017

The document pointed out by Frode Bo Heiland reminds me of a political issue I stumbled across in listening to music on YouTube. It appears that the Sami are recognized as an indigenous people by Sweden, but Finland refuses to give them the same recognition.

Given that Finnish, Estonian, and Sami are all very closely related languages, I would think that the Finns do have an excuse for this apparently retrograde political position. Recognizing the Sami as an indigenous people would suggest that the Finns themselves are an indigenous people - as opposed to a civilized people every bit the equal of Swedes, Frenchmen, and so on.

But they could be recognized as a separate nationality, like Basques or Welshmen, without the Finns having to categorize themselves or anyone else as primitive savages. This would make everyone happy.

Josh_F · July 2017

So are the combining diacritical marks truly necessarily for modern Cyrillic text?

I recently came across the Lettersoup page on Bulgarian Cyrillic localized forms and under "Marks in the Cyrillic Script" section. They say...

"Some characters in the Cyrillic script need marks but they do not have a Unicode and actually do not exist as precomposed characters."

Is this true?

André G. Isaak · July 2017

Yes, for full coverage of cyrillic you minimally need a combining dieresis, a combining breve (the cyrillic-looking kind), a combining macron, and a combing acute. I'm not sure about grave. Double-acute, double-grave, and inverted breve *might* be needed for serbian poetics but not for actual day to day use (they're used in the latin alphabet for this purpose, but I'm not 100% sure if they are used in cyrillic).

George Thomas · July 2017

The Cyrillic set I use has all the usual Latin combining accents with the exception of circumflex, dot, ogonek and ring. I devised the set by referring to a variety of sources and as much as I could, determined they were valid. Most are unencoded and likely of use primarily in loanwords and transliteration. The time spent including them is so minimal I can't find a valid reason to omit them.

André G. Isaak · July 2017

Sorry. I forgot to mention combining caron, which is also used.

André

Josh_F · July 2017

George Thomas said:

The Cyrillic set I use has all the usual Latin combining accents with the exception of circumflex, dot, ogonek and ring. I devised the set by referring to a variety of sources and as much as I could, determined they were valid. Most are unencoded and likely of use primarily in loanwords and transliteration. The time spent including them is so minimal I can't find a valid reason to omit them.

The main reason I'm asking is because it will only add minimal time for someone who uses anchors, but I don't usually build accented glyphs with anchors, meaning it will add a bit of time to include any combining marks...

PabloImpallari · July 2017

For reference: https://github.com/google/fonts/tree/master/tools/encodings

Josh_F · August 2017

André G. Isaak said:

Yes, for full coverage of cyrillic you minimally need a combining dieresis, a combining breve (the cyrillic-looking kind), a combining macron, and a combing acute. I'm not sure about grave. Double-acute, double-grave, and inverted breve *might* be needed for serbian poetics but not for actual day to day use (they're used in the latin alphabet for this purpose, but I'm not 100% sure if they are used in cyrillic).

Hi André (or anyone else who wants to join in). A couple more Cyrillic combining accent questions if you don't mind...

when you say for "full coverage of cyrillic" are you saying combining diacritics are necessary in day to day use, like an é in French or ñ in Spanish?

Or are you saying more so to cover every possible orthographic need in things like grammar books and dictionaries for pronunciation and showing stress?

The reason I ask is because I've been trying to see how extensive the Cyrillic language support is from some of the larger foundries, and while I know Hoefler isn't known for making Cyrillic fonts, Gotham contains no combining diacritics, yet mentions...

"A survey into linguistic, cultural, political, economic, and technological conditions in the region, along with a review of typography created by native speakers, led to H&Co’s Cyrillic-X character set, which is included standard in all Gotham packages. Consulting with H&Co on the project were two Cyrillists: Maxim Zhukov, former Typographic Coordinator to the United Nations, and Ilya Ruderman, creator of the Type & Typography program at the British Higher School of Art and Design in Moscow." https://www.typography.com/fonts/gotham/features/gotham-language-support

Commercial Type hired Ilya Ruderman as well for their Cyrillic extensions and none of their typefaces contain combining accents.

This is by no means trying to question your knowledge, it is more so me just trying to sort out my confused and uninformed mind.

Cyrillics I really need to bother with

Comments

Categories