Cyrillics I really need to bother with

While there's a time and a place for a fully decked-out Cyrillic Unicode range, I'm trying to come up with some better choices for where to draw the line. 

I've noticed a lot of fonts have a limited Cyrillic set that goes from 0400 to 045F.  0460 to 0489 are historical glyphs so I probably won't bother with those. But I don't know much about the 048A to 04F9 range. I know the 0490 Ґ and 0491 ґ are used in Ukrainian so I'll start including those in my Cyrillic set. Are there any other characters in that range which I should definitely include or not bother with?


Tagged:
«1

Comments

  • I always found @Thomas Phinney’s research for Adobe quite helpful.
  • Here's an updated image with with the glyphs mentioned in Thomas' article. In case my goal wasn't clear: I'm trying to decide what a slightly more ambitious Cyrillic set could look like...ending at 045F as many fonts do seems like a waste since, with just a few more glyphs, so many more languages could be covered. But I don't want to waste my time adding historical glyphs. I also want to avoid supporting languages that are almost extinct or transitioning to Latin. It's cruel but I can't support every language with every font.


  • Kent LewKent Lew Posts: 638
    edited December 2015
    Not that long ago, I had occasion to attempt to sort out something similar. I wound up parsing through the data in this Wikipedia entry: https://en.wikipedia.org/wiki/List_of_Cyrillic_letters

    And cross-checking with data here: http://www.eki.ee/letter/
  • I've been researching this same thing. The page Kent references on Wiki is an excellent one, and another Wiki page I found quite useful, especially the comparison chart at the very bottom, is here:
    https://en.wikipedia.org/wiki/Cyrillic_alphabets

    There is also the Bulgarian design differences issue which involves some of the lowercase.
  • Thomas PhinneyThomas Phinney Posts: 802
    edited December 2015
    A couple of years ago, I took my research mentioned previously, plus some up-to-date input from the good folks at Adobe, and created FontLab .enc files for Adobe Cyrillic 1, 2 and 3. 107, 155, and 251 glyphs, respectively. https://github.com/tphinney/font-tools
  • Thank you, Kent.
  • Thanks, everybody. The glyphs I've highlighted seem to cover all of these except for these two languages.

    Abkhaz (7000 speakers/22 glyphs) ӶҔҼҾӠҞҨҦҬҲҴӷҕҽҿӡҟҩҧҭҳҵ
    Kildin Sami (600 speakers/18 glyphs) ЙҊӅӍӉӇҎҌӬйҋӆӎӊӈҏҍӭ
  • edited December 2015
    Wonderful resource, Kent, but your Kildin Sami alphabet is not complete. See this document by Michael Reißler: http://www.siberian-studies.org/publications/PDF/sikriessler.pdf

  • Kent LewKent Lew Posts: 638
    Frode — Thanks for directing me to that PDF. Kildin Sami was definitely one of the more difficult to find any consistent information for. It was hard to ascertain whether the lengthened vowels were truly alphabetic. And most of those do not have codepoints anyway.

    I see that in the version of my data I posted above, the Ӣӣ and Ӯӯ went missing. You’ll note my hedged comment at the end of that line about the combining macron. As far as I can tell the ӢӣӮӯ were encoded in Unicode for Tajik; but the rest of the “macroned” vowels for Kildin Sami were never included.

    Hard to tell what’s the best approach in a situation like this, where only a few of a pattern of related characters are encoded and the rest must be achieved with combining accents.

    But yeah, I suppose the precomposed ӢӣӮӯ should have been left in the listing. They are necessary but not sufficient. (Which may be true of some of the other langs as well; such is the lot of so-called “minority” languages.)
  • This is true for many languages, yes. Apache, Bislama, Guaraní, Low Saxon, Khmer Romanization, Chickasaw, Gooniyandi, Tłı̨chǫ, Cape Verdean Creole, Navajo, Samogitan, Aleut, Marshallese, Romany, Laz, Elfdalian, to name “a few”.


  • Is there reason to add Palochka glyph when people just type 1 or uppercase i in practice because there's no keyboard layout for it?
  • @Joon Park I came here to ask exactly the same question. If you click View all you can see how palochkas are represented in different fonts.

    http://www.fileformat.info/info/unicode/char/04c0/fontsupport.htm
    http://www.fileformat.info/info/unicode/char/04cF/fontsupport.htm
  • @Joon Park I came here to ask exactly the same question. If you click View all you can see how palochkas are represented in different fonts.

    http://www.fileformat.info/info/unicode/char/04c0/fontsupport.htm
    http://www.fileformat.info/info/unicode/char/04cF/fontsupport.htm
    Great to know many fonts still include them regardless.

    It's bit confusing though, I see lowercase being represented as uppercase i as well as lowercase L. Which is correct both in practice and semantically? 
  • Ray LarabieRay Larabie Posts: 647
    edited December 2015
    Although the wiki doesn't clearly state what the deal is with the lowercase form, there's some helpful background on the talk page. https://en.wikipedia.org/wiki/Talk:Palochka

    I couldn't even find a web page that displays a lowercase palochka in a sentence. There are probably very few people on the planet who can answer this.
  • Kent LewKent Lew Posts: 638
    FWIW, the note about palochka in the Unicode Standard says this:
    Palochka. U+04C0 “I” CYRILLIC LETTER PALOCHKA is used in Cyrillic orthographies for a number of Caucasian languages, such as Adyghe, Avar, Chechen, and Kabardian. The name palochka itself is based on the Russian word for “stick,” referring to the shape of the letter. The glyph for palochka is usually indistinguishable from an uppercase Latin “I” or U+0406 “I” CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I; however, in some serifed fonts it may be displayed without serifs to make it more visually distinct.

    In use, palochka typically modifies the reading of a preceding letter, indicating that it is an ejective. The palochka is generally caseless and should retain its form even in lowercased Cyrillic text. However, there is some evidence of distinctive lowercase forms; for those instances, U+04CF CYRILLIC SMALL LETTER PALOCHKA may be used.
  • Kent Lew said:
    FWIW, the note about palochka in the Unicode Standard says this:
    Palochka. U+04C0 “I” CYRILLIC LETTER PALOCHKA is used in Cyrillic orthographies for a number of Caucasian languages, such as Adyghe, Avar, Chechen, and Kabardian. The name palochka itself is based on the Russian word for “stick,” referring to the shape of the letter. The glyph for palochka is usually indistinguishable from an uppercase Latin “I” or U+0406 “I” CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I; however, in some serifed fonts it may be displayed without serifs to make it more visually distinct.

    In use, palochka typically modifies the reading of a preceding letter, indicating that it is an ejective. The palochka is generally caseless and should retain its form even in lowercased Cyrillic text. However, there is some evidence of distinctive lowercase forms; for those instances, U+04CF CYRILLIC SMALL LETTER PALOCHKA may be used.
    does the lowercase palochka resemble an lowercase of L?
  • When you dig that much deep into the “which for what” issue: maybe you’re quicker arriving at a comprehensive solution when you do just all characters and thus be sure no one would miss anything – ?

  • you’re quicker arriving at a comprehensive solution
    This is also true of Joon Park post about Greek glyphs.  I am pretty sure you can draw the glyph quicker than you can unearth the reasonable degree of usage.
  • Kent LewKent Lew Posts: 638
    does the lowercase palochka resemble an lowercase of L?
    I do not think the matter is settled at all. The very notion of a lowercase palochka seems to be a matter of debate.

    We would need native speakers to weigh in on current preferences.

  • Since many of us here will spend the rest of our lives filling in these Unicode ranges over and over again, it pays to spend more time working out what to include, what not to include. If you're working on a long-term project where you intend to fill everything in, go for it. Filling in Unicode ranges without knowing how/if glyphs are ever going to be used wastes time in the long term and bloats fonts. New type designers, unsure of which glyphs they should include may look to existing fonts for guidance which perpetuates wrong/junk glyphs. Except in the case of comprehensive language fonts, we all decide which characters we're going to support and which characters we're not going to support. Knowing which forms are historical or deprecated is important in making those decisions.

    In the case of textured/distressed fonts. There's a breaking limit to the number of non-composite glyphs that can be included. Knowing which glyphs are deprecated, historical or rarely used can contribute to more language coverage and more stable fonts.

    For example, the long s: ſ. Even beginners know that this is an historical glyph. It's certainly appropriate in a comprehensive cover-everything font, a old-timey Caslon or a distressed pirate themed font but in an ultramodern design, it's clutter. There's a deprecated character right in the middle of Latin Extended A that we all know about: ʼn. There it is at the top of table 2. http://unicode.org/review/pr-122.html
    A lot of new fonts still include this glyph, not because of it's usefulness, but because it just happens to be in the middle of Latin extended A.

    Feel free to fill in everything if you want but perhaps we shouldn't use Unicode tables to decide where to stop.
  • Joon ParkJoon Park Posts: 56
    edited December 2015
    0344 ( ̈́ ) COMBINING GREEK DIALYTIKA TONOS * 
    037E ( ; ) GREEK QUESTION MARK * 
    0387 ( · ) GREEK ANO TELEIA *
    20A4 ( ₤ ) LIRA SIGN 
    2126 ( Ω ) OHM SIGN * 

    Interesting proposal list, so discourage of use because of lack of practice or duplicate glyphs are in existence?

    Edit: NM, had to look up Normalization Form C.
  • Wei HuangWei Huang Posts: 70
    Does anyone have reliable documentation on the localised forms of Cyrillic? 
  • Stefan PeevStefan Peev Posts: 39
    edited July 18
    Wei Huang I have started such a kind of documentation. Look at the Local Fonts (here). And here are the local forms by languages – Bulgarian Cyrillic Feature Locl, Serbian Cyrillic Feature Locl, Macedonian Cyrillic Feature Locl, Bashkir Cyrillic Feature Locl, Chuvash Cyrillic Feature Locl.






  • John SavardJohn Savard Posts: 91
    The document pointed out by Frode Bo Heiland reminds me of a political issue I stumbled across in listening to music on YouTube. It appears that the Sami are recognized as an indigenous people by Sweden, but Finland refuses to give them the same recognition.

    Given that Finnish, Estonian, and Sami are all very closely related languages, I would think that the Finns do have an excuse for this apparently retrograde political position. Recognizing the Sami as an indigenous people would suggest that the Finns themselves are an indigenous people - as opposed to a civilized people every bit the equal of Swedes, Frenchmen, and so on.

    But they could be recognized as a separate nationality, like Basques or Welshmen, without the Finns having to categorize themselves or anyone else as primitive savages. This would make everyone happy.
  • Josh_FinkleaJosh_Finklea Posts: 26
    So are the combining diacritical marks truly necessarily for modern Cyrillic text?

    I recently came across the Lettersoup page on Bulgarian Cyrillic localized forms and under "Marks in the Cyrillic Script" section. They say...

    "Some characters in the Cyrillic script need marks but they do not have a Unicode and actually do not exist as precomposed characters."

    Is this true?
  • André G. IsaakAndré G. Isaak Posts: 80
    edited July 18
    Yes, for full coverage of cyrillic you minimally need a combining dieresis, a combining breve (the cyrillic-looking kind), a combining macron, and a combing acute. I'm not sure about grave. Double-acute, double-grave, and inverted breve *might* be needed for serbian poetics but not for actual day to day use (they're used in the latin alphabet for this purpose, but I'm not 100% sure if they are used in cyrillic).
  • George ThomasGeorge Thomas Posts: 382
    The Cyrillic set I use has all the usual Latin combining accents with the exception of circumflex, dot, ogonek and ring. I devised the set by referring to a variety of sources and as much as I could, determined they were valid. Most are unencoded and likely of use primarily in loanwords and transliteration. The time spent including them is so minimal I can't find a valid reason to omit them.
  • Sorry. I forgot to mention combining caron, which is also used.

    André
  • Josh_FinkleaJosh_Finklea Posts: 26
    The Cyrillic set I use has all the usual Latin combining accents with the exception of circumflex, dot, ogonek and ring. I devised the set by referring to a variety of sources and as much as I could, determined they were valid. Most are unencoded and likely of use primarily in loanwords and transliteration. The time spent including them is so minimal I can't find a valid reason to omit them.

    The main reason I'm asking is because it will only add minimal time for someone who uses anchors, but I don't usually build accented glyphs with anchors, meaning it will add a bit of time to include any combining marks... 
Sign In or Register to comment.