While there's a time and a place for a fully decked-out Cyrillic Unicode range, I'm trying to come up with some better choices for where to draw the line.
I've noticed a lot of fonts have a limited Cyrillic set that goes from 0400 to 045F. 0460 to 0489 are historical glyphs so I probably won't bother with those. But I don't know much about the 048A to 04F9 range. I know the 0490 Ґ and 0491 ґ are used in Ukrainian so I'll start including those in my Cyrillic set. Are there any other characters in that range which I should definitely include or not bother with?
Comments
And cross-checking with data here: http://www.eki.ee/letter/
https://en.wikipedia.org/wiki/Cyrillic_alphabets
There is also the Bulgarian design differences issue which involves some of the lowercase.
I make no warranty that this is either entirely complete or completely accurate. It is assembled from multiple Internet sources, as described above.
Note that for many of these, Cyrillic may no longer be the primary script. There is plenty of politics wrapped up in languages/dialects and scripts. Also note that there are alternate spellings or even alternate names for many of these. I’ve done what I can to try to identify the preferred name/spelling.
Those who are adept at Python can convert this to Unicode codepoints, as desired.
Abkhaz (7000 speakers/22 glyphs) ӶҔҼҾӠҞҨҦҬҲҴӷҕҽҿӡҟҩҧҭҳҵ
Kildin Sami (600 speakers/18 glyphs) ЙҊӅӍӉӇҎҌӬйҋӆӎӊӈҏҍӭ
I see that in the version of my data I posted above, the Ӣӣ and Ӯӯ went missing. You’ll note my hedged comment at the end of that line about the combining macron. As far as I can tell the ӢӣӮӯ were encoded in Unicode for Tajik; but the rest of the “macroned” vowels for Kildin Sami were never included.
Hard to tell what’s the best approach in a situation like this, where only a few of a pattern of related characters are encoded and the rest must be achieved with combining accents.
But yeah, I suppose the precomposed ӢӣӮӯ should have been left in the listing. They are necessary but not sufficient. (Which may be true of some of the other langs as well; such is the lot of so-called “minority” languages.)
http://www.fileformat.info/info/unicode/char/04c0/fontsupport.htm
http://www.fileformat.info/info/unicode/char/04cF/fontsupport.htm
It's bit confusing though, I see lowercase being represented as uppercase i as well as lowercase L. Which is correct both in practice and semantically?
I couldn't even find a web page that displays a lowercase palochka in a sentence. There are probably very few people on the planet who can answer this.
We would need native speakers to weigh in on current preferences.
In the case of textured/distressed fonts. There's a breaking limit to the number of non-composite glyphs that can be included. Knowing which glyphs are deprecated, historical or rarely used can contribute to more language coverage and more stable fonts.
For example, the long s: ſ. Even beginners know that this is an historical glyph. It's certainly appropriate in a comprehensive cover-everything font, a old-timey Caslon or a distressed pirate themed font but in an ultramodern design, it's clutter. There's a deprecated character right in the middle of Latin Extended A that we all know about: ʼn. There it is at the top of table 2. http://unicode.org/review/pr-122.html
A lot of new fonts still include this glyph, not because of it's usefulness, but because it just happens to be in the middle of Latin extended A.
Feel free to fill in everything if you want but perhaps we shouldn't use Unicode tables to decide where to stop.
037E ( ; ) GREEK QUESTION MARK *
0387 ( · ) GREEK ANO TELEIA *
20A4 ( ₤ ) LIRA SIGN
2126 ( Ω ) OHM SIGN *
Interesting proposal list, so discourage of use because of lack of practice or duplicate glyphs are in existence?
Edit: NM, had to look up Normalization Form C.
Given that Finnish, Estonian, and Sami are all very closely related languages, I would think that the Finns do have an excuse for this apparently retrograde political position. Recognizing the Sami as an indigenous people would suggest that the Finns themselves are an indigenous people - as opposed to a civilized people every bit the equal of Swedes, Frenchmen, and so on.
But they could be recognized as a separate nationality, like Basques or Welshmen, without the Finns having to categorize themselves or anyone else as primitive savages. This would make everyone happy.
I recently came across the Lettersoup page on Bulgarian Cyrillic localized forms and under "Marks in the Cyrillic Script" section. They say...
"Some characters in the Cyrillic script need marks but they do not have a Unicode and actually do not exist as precomposed characters."
Is this true?
André
The main reason I'm asking is because it will only add minimal time for someone who uses anchors, but I don't usually build accented glyphs with anchors, meaning it will add a bit of time to include any combining marks...
Hi André (or anyone else who wants to join in). A couple more Cyrillic combining accent questions if you don't mind...