Cyrillics I really need to bother with
Ray Larabie
Posts: 1,431
While there's a time and a place for a fully decked-out Cyrillic Unicode range, I'm trying to come up with some better choices for where to draw the line.
I've noticed a lot of fonts have a limited Cyrillic set that goes from 0400 to 045F. 0460 to 0489 are historical glyphs so I probably won't bother with those. But I don't know much about the 048A to 04F9 range. I know the 0490 Ґ and 0491 ґ are used in Ukrainian so I'll start including those in my Cyrillic set. Are there any other characters in that range which I should definitely include or not bother with?
I've noticed a lot of fonts have a limited Cyrillic set that goes from 0400 to 045F. 0460 to 0489 are historical glyphs so I probably won't bother with those. But I don't know much about the 048A to 04F9 range. I know the 0490 Ґ and 0491 ґ are used in Ukrainian so I'll start including those in my Cyrillic set. Are there any other characters in that range which I should definitely include or not bother with?
Tagged:
2
Comments
-
I always found @Thomas Phinney’s research for Adobe quite helpful.
4 -
Here's an updated image with with the glyphs mentioned in Thomas' article. In case my goal wasn't clear: I'm trying to decide what a slightly more ambitious Cyrillic set could look like...ending at 045F as many fonts do seems like a waste since, with just a few more glyphs, so many more languages could be covered. But I don't want to waste my time adding historical glyphs. I also want to avoid supporting languages that are almost extinct or transitioning to Latin. It's cruel but I can't support every language with every font.
0 -
Not that long ago, I had occasion to attempt to sort out something similar. I wound up parsing through the data in this Wikipedia entry: https://en.wikipedia.org/wiki/List_of_Cyrillic_letters
And cross-checking with data here: http://www.eki.ee/letter/
1 -
I've been researching this same thing. The page Kent references on Wiki is an excellent one, and another Wiki page I found quite useful, especially the comparison chart at the very bottom, is here:
https://en.wikipedia.org/wiki/Cyrillic_alphabets
There is also the Bulgarian design differences issue which involves some of the lowercase.
1 -
A couple of years ago, I took my research mentioned previously, plus some up-to-date input from the good folks at Adobe, and created FontLab .enc files for Adobe Cyrillic 1, 2 and 3. 107, 155, and 251 glyphs, respectively. https://github.com/tphinney/font-tools5
-
FWIW, here is what I accumulated, formatted as a Python dict:
cyrlLangChars = {
I make no warranty that this is either entirely complete or completely accurate. It is assembled from multiple Internet sources, as described above.
# Slavic Languages
'Russian': u'АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя',
'Belarusian': u'АБВГДЕЁЖЗІЙКЛМНОПРСТУЎФХЦЧШЫЬЭЮЯабвгдеёжзійклмнопрстуўфхцчшыьэюя',
'Ukrainian': u'АБВГҐДЕЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЬЮЯабвгґдеєжзиіїйклмнопрстуфхцчшщьюя',
'Rusyn': u'АБВГҐДЕЁЄЖЗИІЇЙКЛМНОПРСТУФХЦЧШЩЪЫЬЮЯабвгґдеёєжзиіїйклмнопрстуфхцчшщъыьюя',
'Serbian': u'АБВГДЂЕЖЗИЈКЛЉМНЊОПРСТЋУФХЦЧЏШабвгдђежзијклљмнњопрстћуфхцчџш',
'Bulgarian': u'АБВГДЕЖЗИЍЙКЛМНОПРСТУФХЦЧШЩЪЬЮЯабвгдежзиѝйклмнопрстуфхцчшщъьюя', # Ѝѝ -- for disambiguation of feminine possessive pronoun
'Montenegrin': u'АБВГДЂЕЖЗИЈКЛЉМНЊОПРСТЋУФХЦЧЏШ́абвгдђежзијклљмнњопрстћуфхцчџш',
'Macedonian': u'АБВГЃДЕЀЖЗЅИЍЈКЛЉМНЊОПРСТЌУФХЦЧЏШабвгѓдеѐжзѕиѝјклљмнњопрстќуфхцчџш', # ЀЍѐѝ -- for disambiguation
# Other Indo-European/Romance Languages
'Moldovan': u'АБВГДЕЖӁЗИЙКЛМНОПРСТУФХЦЧШЫЬЭЮЯабвгдежӂзийклмнопрстуфхцчшыьэюя',
# Iranian Languages
'Kurdish': u'АБВГДЕӘЖЗИЙКЛМНОÖПРСТУФХҺЧШЩЬЭԚԜабвгдеәжзийклмноöпрстуфхһчшщьэԛԝ',
'Ossetian': u'АӔБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯаӕбвгдеёжзийклмнопрстуфхцчшщъыьэюя',
'Tajik': u'АБВГҒДЕЁЖЗИӢЙКҚЛМНОПРСТУӮФХҲЦЧҶШЩЪЫЬЭЮЯабвгғдеёжзиӣйкқлмнопрстуӯфхҳцчҷшщъыьэюя', # ЦЩЫЬцщыь -- loanwords only
# Uralic Languages
'Kildin Sami': u'АӒБВГДЕЁЖЗИЙҊЈКЛӅМӍНӉӇОПРҎСТУФХҺЦЧШЩЪЫЬҌЭӬЮЯаӓбвгдеёжзийҋјклӆмӎнӊӈопрҏстуфхһцчшщъыьҍэӭюя', # cmb macron may be required
'Komi-Permyak': u'АБВГДЕЁЖЗИІЙКЛМНОӦПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзиійклмноӧпрстуфхцчшщъыьэюя',
'Meadow Mari': u'АБВГДЕЁЖЗИЙКЛМНҤОӦПРСТУӰФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнҥоӧпрстуӱфхцчшщъыьэюя',
'Hill Mari': u'АӒБВГДЕЁЖЗИЙКЛМНОӦПРСТУӰФХЦЧШЩЪЫӸЬЭЮЯаӓбвгдеёжзийклмноӧпрстуӱфхцчшщъыӹьэюя',
'Udmurt': u'АБВГДЕЁЖӜЗӞИӤЙКЛМНОӦПРСТУФХЦЧӴШЩЪЫЬЭЮЯабвгдеёжӝзӟиӥйклмноӧпрстуфхцчӵшщъыьэюя',
'Khanty': u'АӒӘӚБВГДЕЁЖЗИЙКӃЛМНӇОӦӨӪПРСТУӰФХЦЧШЩЪЫЬЭЮЯаӓәӛбвгдеёжзийкӄлмнӈоӧөӫпрстуӱфхцчшщъыьэюя',
'Nenets': u'АБВГДЕЁЖЗИЙКЛМНӇОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнӈопрстуфхцчшщъыьэюя',
# Caucasian Languages
'Abkhaz': u'АБВГӶҔДЕҼҾЖЗӠИКҚҞЛМНОҨПҦРСТҬУФХҲЦҴЧҶЏШЫЬабвгӷҕдеҽҿжзӡикқҟлмноҩпҧрстҭуфхҳцҵчҷџшыь',
'Kabardian': u'АБВГДЕЖЗИӀЙКЛМНОПРСТУФХЦЧШЩЪЫЬЮЯабвгдежзиӏйклмнопрстуфхцчшщъыьюя',
'Chechen': u'АБВГДЕЁЖЗИӀЙКЛМНОПРСТУФХЦЧШЪЫЬЭЮЯабвгдеёжзиӏйклмнопрстуфхцчшъыьэюя',
# Turkic Languages
'Azerbaijani': u'АӘБВГҒДЕЖЗИЙЈКҜЛМНОӨПРСТУҮФХҺЧҸШЫаәбвгғдежзийјкҝлмноөпрстуүфхһчҹшы',
'Turkmen': u'АӘБВГДЕЁЖҖЗИЙКЛМНҢОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯаәбвгдеёжҗзийклмнңоөпрстуүфхцчшщъыьэюя',
'Kazakh': u'АӘБВГҒДЕЁЖЗИІЙКҚЛМНҢОӨПРСТУҮҰФХҺЦЧШЩЪЫЬЭЮЯаәбвгғдеёжзиійкқлмнңоөпрстуүұфхһцчшщъыьэюя',
'Kyrgyz': u'АБВГДЕЁЖЗИЙКЛМНҢОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнңоөпрстуүфхцчшщъыьэюя', # ВФЦЩЪЬвфцщъь -- loanwords only
'Karachay': u'АБВГДЕЁЖЗИЙКЛМНОПРСТУЎФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуўфхцчшщъыьэюя',
'Bashkir': u'АӘБВГҒДЕЁЖЗҘИЙКҠЛМНҢОӨПРСҪТУҮФХҺЦЧШЩЪЫЬЭЮЯаәбвгғдеёжзҙийкҡлмнңоөпрсҫтуүфхһцчшщъыьэюя',
'Tatar': u'АӘБВГДЕЁЖҖЗИЙКЛМНҢОӨПРСТУҮФХҺЦЧШЩЪЫЬЭЮЯаәбвгдеёжҗзийклмнңоөпрстуүфхһцчшщъыьэюя',
'Altai': u'АБВГДЕЁЖЗИЙЈКЛМНҤОӦПРСТУӰФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийјклмнҥоӧпрстуӱфхцчшщъыьэюя',
'Khakass': u'АБВГҒДЕЁЖЗИІЙКЛМНҢОӦПРСТУӰФХЦЧӋШЩЪЫЬЭЮЯабвгғдеёжзиійклмнңоӧпрстуӱфхцчӌшщъыьэюя',
'Sakha': u'АБВГҔДЕЁЖЗИЙКЛМНҤОӨПРСТУҮФХҺЦЧШЩЪЫЬЭЮЯабвгҕдеёжзийклмнҥоөпрстуүфхһцчшщъыьэюя',
'Tuvin': u'АБВГДЕЁЖЗИЙКЛМНҢОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнңоөпрстуүфхцчшщъыьэюя',
'Uzbek': u'АБВГҒДЕЁЖЗИЙКҚЛМНОПРСТУЎФХҲЦЧШЩЪЬЭЮЯабвгғдеёжзийкқлмнопрстуўфхҳцчшщъьэюя',
'Uyghur': u'АӘБВГҒДЕЖҖЗИЙКҚЛМНҢОӨПРСТУҮФХҺЧШЮЯаәбвгғдежҗзийкқлмнңоөпрстуүфхһчшюя',
'Chuvash': u'АӐБВГДЕЁӖЖЗИЙКЛМНОПРСҪТУӲФХЦЧШЩЪЫЬЭЮЯаӑбвгдеёӗжзийклмнопрсҫтуӳфхцчшщъыьэюя',
'Evenki': u'АБВГДЕЁЖЗИЙКЛМНӇОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнӈопрстуфхцчшщъыьэюя',
# Mongolian Languages
'Buryat': u'АБВГДЕЁЖЗИЙКЛМНОӨПРСТУҮФХҺЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмноөпрстуүфхһцчшщъыьэюя', # КФЩЪкфщъ -- loanwords only
'Khalkha': u'АБВГДЕЁЖЗИЙКЛМНОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмноөпрстуүфхцчшщъыьэюя',
'Kalmyk': u'АӘБВГДЕЁЖҖЗИЙКЛМНҢОӨПРСТУҮФХҺЦЧШЩЫЬЭЮЯаәбвгдеёжҗзийклмнңоөпрстуүфхһцчшщыьэюя',
# Sino-Tibetan Languages
'Dungan': u'АӘБВГДЕЁЖҖЗИЙКЛМНҢОПРСТУЎҮФХЦЧШЩЪЫЬЭЮЯаәбвгдеёжҗзийклмнңопрстуўүфхцчшщъыьэюя',
}
Note that for many of these, Cyrillic may no longer be the primary script. There is plenty of politics wrapped up in languages/dialects and scripts. Also note that there are alternate spellings or even alternate names for many of these. I’ve done what I can to try to identify the preferred name/spelling.
Those who are adept at Python can convert this to Unicode codepoints, as desired.
23 -
Thank you, Kent.
0 -
Thanks, everybody. The glyphs I've highlighted seem to cover all of these except for these two languages.
Abkhaz (7000 speakers/22 glyphs) ӶҔҼҾӠҞҨҦҬҲҴӷҕҽҿӡҟҩҧҭҳҵ
Kildin Sami (600 speakers/18 glyphs) ЙҊӅӍӉӇҎҌӬйҋӆӎӊӈҏҍӭ0 -
Frode — Thanks for directing me to that PDF. Kildin Sami was definitely one of the more difficult to find any consistent information for. It was hard to ascertain whether the lengthened vowels were truly alphabetic. And most of those do not have codepoints anyway.
I see that in the version of my data I posted above, the Ӣӣ and Ӯӯ went missing. You’ll note my hedged comment at the end of that line about the combining macron. As far as I can tell the ӢӣӮӯ were encoded in Unicode for Tajik; but the rest of the “macroned” vowels for Kildin Sami were never included.
Hard to tell what’s the best approach in a situation like this, where only a few of a pattern of related characters are encoded and the rest must be achieved with combining accents.
But yeah, I suppose the precomposed ӢӣӮӯ should have been left in the listing. They are necessary but not sufficient. (Which may be true of some of the other langs as well; such is the lot of so-called “minority” languages.)
0 -
Is there reason to add Palochka glyph when people just type 1 or uppercase i in practice because there's no keyboard layout for it?0
-
@Joon Park I came here to ask exactly the same question. If you click View all you can see how palochkas are represented in different fonts.
http://www.fileformat.info/info/unicode/char/04c0/fontsupport.htm
http://www.fileformat.info/info/unicode/char/04cF/fontsupport.htm0 -
Ray Larabie said:@Joon Park I came here to ask exactly the same question. If you click View all you can see how palochkas are represented in different fonts.
http://www.fileformat.info/info/unicode/char/04c0/fontsupport.htm
http://www.fileformat.info/info/unicode/char/04cF/fontsupport.htm
It's bit confusing though, I see lowercase being represented as uppercase i as well as lowercase L. Which is correct both in practice and semantically?0 -
Although the wiki doesn't clearly state what the deal is with the lowercase form, there's some helpful background on the talk page. https://en.wikipedia.org/wiki/Talk:Palochka
I couldn't even find a web page that displays a lowercase palochka in a sentence. There are probably very few people on the planet who can answer this.
0 -
FWIW, the note about palochka in the Unicode Standard says this:
Palochka. U+04C0 “I” CYRILLIC LETTER PALOCHKA is used in Cyrillic orthographies for a number of Caucasian languages, such as Adyghe, Avar, Chechen, and Kabardian. The name palochka itself is based on the Russian word for “stick,” referring to the shape of the letter. The glyph for palochka is usually indistinguishable from an uppercase Latin “I” or U+0406 “I” CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I; however, in some serifed fonts it may be displayed without serifs to make it more visually distinct.
In use, palochka typically modifies the reading of a preceding letter, indicating that it is an ejective. The palochka is generally caseless and should retain its form even in lowercased Cyrillic text. However, there is some evidence of distinctive lowercase forms; for those instances, U+04CF CYRILLIC SMALL LETTER PALOCHKA may be used.0 -
Kent Lew said:FWIW, the note about palochka in the Unicode Standard says this:
Palochka. U+04C0 “I” CYRILLIC LETTER PALOCHKA is used in Cyrillic orthographies for a number of Caucasian languages, such as Adyghe, Avar, Chechen, and Kabardian. The name palochka itself is based on the Russian word for “stick,” referring to the shape of the letter. The glyph for palochka is usually indistinguishable from an uppercase Latin “I” or U+0406 “I” CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I; however, in some serifed fonts it may be displayed without serifs to make it more visually distinct.
In use, palochka typically modifies the reading of a preceding letter, indicating that it is an ejective. The palochka is generally caseless and should retain its form even in lowercased Cyrillic text. However, there is some evidence of distinctive lowercase forms; for those instances, U+04CF CYRILLIC SMALL LETTER PALOCHKA may be used.0 -
When you dig that much deep into the “which for what” issue: maybe you’re quicker arriving at a comprehensive solution when you do just all characters and thus be sure no one would miss anything – ?
4 -
you’re quicker arriving at a comprehensive solution
This is also true of Joon Park post about Greek glyphs. I am pretty sure you can draw the glyph quicker than you can unearth the reasonable degree of usage.0 -
does the lowercase palochka resemble an lowercase of L?
I do not think the matter is settled at all. The very notion of a lowercase palochka seems to be a matter of debate.
We would need native speakers to weigh in on current preferences.
1 -
Since many of us here will spend the rest of our lives filling in these Unicode ranges over and over again, it pays to spend more time working out what to include, what not to include. If you're working on a long-term project where you intend to fill everything in, go for it. Filling in Unicode ranges without knowing how/if glyphs are ever going to be used wastes time in the long term and bloats fonts. New type designers, unsure of which glyphs they should include may look to existing fonts for guidance which perpetuates wrong/junk glyphs. Except in the case of comprehensive language fonts, we all decide which characters we're going to support and which characters we're not going to support. Knowing which forms are historical or deprecated is important in making those decisions.
In the case of textured/distressed fonts. There's a breaking limit to the number of non-composite glyphs that can be included. Knowing which glyphs are deprecated, historical or rarely used can contribute to more language coverage and more stable fonts.
For example, the long s: ſ. Even beginners know that this is an historical glyph. It's certainly appropriate in a comprehensive cover-everything font, a old-timey Caslon or a distressed pirate themed font but in an ultramodern design, it's clutter. There's a deprecated character right in the middle of Latin Extended A that we all know about: ʼn. There it is at the top of table 2. http://unicode.org/review/pr-122.html
A lot of new fonts still include this glyph, not because of it's usefulness, but because it just happens to be in the middle of Latin extended A.
Feel free to fill in everything if you want but perhaps we shouldn't use Unicode tables to decide where to stop.2 -
0344 ( ̈́ ) COMBINING GREEK DIALYTIKA TONOS *
037E ( ; ) GREEK QUESTION MARK *
0387 ( · ) GREEK ANO TELEIA *
20A4 ( ₤ ) LIRA SIGN
2126 ( Ω ) OHM SIGN *
Interesting proposal list, so discourage of use because of lack of practice or duplicate glyphs are in existence?
Edit: NM, had to look up Normalization Form C.0 -
Does anyone have reliable documentation on the localised forms of Cyrillic?
1 -
Wei Huang I have started such a kind of documentation. Look at the Local Fonts (here). And here are the local forms by languages – Bulgarian Cyrillic Feature Locl, Serbian Cyrillic Feature Locl, Macedonian Cyrillic Feature Locl, Bashkir Cyrillic Feature Locl, Chuvash Cyrillic Feature Locl.
1 -
The document pointed out by Frode Bo Heiland reminds me of a political issue I stumbled across in listening to music on YouTube. It appears that the Sami are recognized as an indigenous people by Sweden, but Finland refuses to give them the same recognition.
Given that Finnish, Estonian, and Sami are all very closely related languages, I would think that the Finns do have an excuse for this apparently retrograde political position. Recognizing the Sami as an indigenous people would suggest that the Finns themselves are an indigenous people - as opposed to a civilized people every bit the equal of Swedes, Frenchmen, and so on.
But they could be recognized as a separate nationality, like Basques or Welshmen, without the Finns having to categorize themselves or anyone else as primitive savages. This would make everyone happy.
0 -
So are the combining diacritical marks truly necessarily for modern Cyrillic text?
I recently came across the Lettersoup page on Bulgarian Cyrillic localized forms and under "Marks in the Cyrillic Script" section. They say...
"Some characters in the Cyrillic script need marks but they do not have a Unicode and actually do not exist as precomposed characters."
Is this true?0 -
Yes, for full coverage of cyrillic you minimally need a combining dieresis, a combining breve (the cyrillic-looking kind), a combining macron, and a combing acute. I'm not sure about grave. Double-acute, double-grave, and inverted breve *might* be needed for serbian poetics but not for actual day to day use (they're used in the latin alphabet for this purpose, but I'm not 100% sure if they are used in cyrillic).
1 -
The Cyrillic set I use has all the usual Latin combining accents with the exception of circumflex, dot, ogonek and ring. I devised the set by referring to a variety of sources and as much as I could, determined they were valid. Most are unencoded and likely of use primarily in loanwords and transliteration. The time spent including them is so minimal I can't find a valid reason to omit them.
0 -
Sorry. I forgot to mention combining caron, which is also used.
André0 -
George Thomas said:The Cyrillic set I use has all the usual Latin combining accents with the exception of circumflex, dot, ogonek and ring. I devised the set by referring to a variety of sources and as much as I could, determined they were valid. Most are unencoded and likely of use primarily in loanwords and transliteration. The time spent including them is so minimal I can't find a valid reason to omit them.
The main reason I'm asking is because it will only add minimal time for someone who uses anchors, but I don't usually build accented glyphs with anchors, meaning it will add a bit of time to include any combining marks...0 -
For reference: https://github.com/google/fonts/tree/master/tools/encodings
0 -
André G. Isaak said:Yes, for full coverage of cyrillic you minimally need a combining dieresis, a combining breve (the cyrillic-looking kind), a combining macron, and a combing acute. I'm not sure about grave. Double-acute, double-grave, and inverted breve *might* be needed for serbian poetics but not for actual day to day use (they're used in the latin alphabet for this purpose, but I'm not 100% sure if they are used in cyrillic).
Hi André (or anyone else who wants to join in). A couple more Cyrillic combining accent questions if you don't mind...when you say for "full coverage of cyrillic" are you saying combining diacritics are necessary in day to day use, like an é in French or ñ in Spanish?Or are you saying more so to cover every possible orthographic need in things like grammar books and dictionaries for pronunciation and showing stress?The reason I ask is because I've been trying to see how extensive the Cyrillic language support is from some of the larger foundries, and while I know Hoefler isn't known for making Cyrillic fonts, Gotham contains no combining diacritics, yet mentions..."A survey into linguistic, cultural, political, economic, and technological conditions in the region, along with a review of typography created by native speakers, led to H&Co’s Cyrillic-X character set, which is included standard in all Gotham packages. Consulting with H&Co on the project were two Cyrillists: Maxim Zhukov, former Typographic Coordinator to the United Nations, and Ilya Ruderman, creator of the Type & Typography program at the British Higher School of Art and Design in Moscow." https://www.typography.com/fonts/gotham/features/gotham-language-supportCommercial Type hired Ilya Ruderman as well for their Cyrillic extensions and none of their typefaces contain combining accents.This is by no means trying to question your knowledge, it is more so me just trying to sort out my confused and uninformed mind.0
Categories
- All Categories
- 43 Introductions
- 3.7K Typeface Design
- 798 Font Technology
- 1K Technique and Theory
- 617 Type Business
- 444 Type Design Critiques
- 541 Type Design Software
- 30 Punchcutting
- 136 Lettering and Calligraphy
- 83 Technique and Theory
- 53 Lettering Critiques
- 483 Typography
- 301 History of Typography
- 114 Education
- 68 Resources
- 498 Announcements
- 79 Events
- 105 Job Postings
- 148 Type Releases
- 165 Miscellaneous News
- 269 About TypeDrawers
- 53 TypeDrawers Announcements
- 116 Suggestions and Bug Reports