How to implement Bulgarian glyphs with grave that have no unicode (А̀ О̀ У̀ … а̀ о̀ у̀ …)?

Martin Wenzel
Martin Wenzel Posts: 48
edited June 2021 in Type Design Critiques
Hi, I’m looking for a solution to add Bulgarian glyphs with grave that don’t have a unicode, such as ъ̀ (ъ with grave) or Я̀ (Я with grave). What’s the Bulgarian way of solving this, what does a real-life solution look like? The most obvious *technical* solution is to combine “Я” and “Combinging Grave Accent” but is that the way Bulgarians deal with this?
I thought the article by Lettersoup https://www.lettersoup.de/what-shall-be-done-for-bulgarian-cyrillic-loclbgr/ was insightful but I'm not sure what the state of affairs is now, 5 years later.
I've seen one solution where the Bulgarian “ъ̀” (Cyrillic hard sign 0x044A + with grave) is simply put on “њ” (Cyrillic nje 0x045A) but that badly messes with the unicode standard, which we really do not want to do.

Comments


  • The most obvious *technical* solution is to combine “Я” and “Combinging Grave Accent” but is that the way Bulgarians deal with this?
    Define anchors for all base characters and all combining accents. That's the way provided by Unicode. Even if a predefined codepoint exists for an accented glyph like \Ä exists in Unicode you should support \A + dieresis, because text could be encoded decomposed.
  • Thanks @Helmut Wollmersdorfer for your message, the anchors are all present and correct. The question was more one of usability: How is a Bulgarian letter such as “ъ̀” accessed by the users in Bulgaria. What is the going method? Just curious to hear if other font developers have experience with this issue as well.
  • Helmut Wollmersdorfer
    edited June 2021
    How is a Bulgarian letter such as “ъ̀” accessed by the users in Bulgaria. 
    That's a problem of the input method (keyboard configuration) or the application creating the text. 

    On Mac OS the keyboard Bulgarian Phonetic has a dead key for grave accent, but then translates to
    $ uni identify њ<br>&nbsp;&nbsp; &nbsp; cpoint&nbsp; dec&nbsp; &nbsp; utf-8 &nbsp; &nbsp; &nbsp; html &nbsp; &nbsp; &nbsp; name<br>'њ'&nbsp; U+045A&nbsp; 1114 &nbsp; d1 9a &nbsp; &nbsp; &nbsp; &njcy; &nbsp; &nbsp; CYRILLIC SMALL LETTER NJE (Lowercase_Letter)<p></p><p></p>

    Other Bulgarian keyboards have no grave accent.

    It's up to the user to solve this. E. g. for historic English, German or Latin I need my own customised keyboards for historic characters (long_s, rotunda, combining \e above). Or have a collection of seldom characters in a file and use copy & paste.

    The only feature you can support on font level in this case is a stylistic set (default off) to substitute 
    'њ' by the glyph of

    $ uni identify ъ̀<br>&nbsp;&nbsp; &nbsp; cpoint&nbsp; dec&nbsp; &nbsp; utf-8 &nbsp; &nbsp; &nbsp; html &nbsp; &nbsp; &nbsp; name<br>'ъ'&nbsp; U+044A&nbsp; 1098 &nbsp; d1 8a &nbsp; &nbsp; &nbsp; &hardcy; &nbsp; CYRILLIC SMALL LETTER HARD SIGN (Lowercase_Letter)<br>'◌̀'&nbsp; U+0300&nbsp; 768&nbsp; &nbsp; cc 80 &nbsp; &nbsp; &nbsp; &#x300;&nbsp; &nbsp; COMBINING GRAVE ACCENT (Nonspacing_Mark)

    Solution on font level is always ugly and needs documentation.

  • Seems the forum software is broken and does not format code correctly.
  • John Savard
    John Savard Posts: 1,135
    At first, I didn't quite understand your question, but once you mentioned the Serbian character for palatalized N, I see what your concern is. Not with input methods, per se, which are, as noted, no concern of a font designer, but with the possibility that in Bulgaria, it is typical for people to use non-Unicode fonts because Unicode doesn't meet their needs.
    I know this was, in fact, the situation with the Burmese language several years ago, because computers with the ability to handle Unicode properly were too expensive for people there, so non-Unicode character sets with a full repertoire of pre-composed glyphs were what was normally used. (Not from personal experience, I just remember happening to read it somewhere.)
    I'm not aware of the situation in Bulgaria at all, but by this time, long after the 8-bit era has passed, I strongly doubt that something like this would be going on there.
  • Speaking as a graphic designer who has sometimes to cope with weird foreign accented characters which are not on my keyboard, not as a type designer, I have found the most expedient solution is to have a little program running which is provided on all Windows systems, it is called Character Map and gives access to all characters available in a font.
    The accented character is then made by using the base character then selecting the combining accent in Character Map and copying it then pasting it into the publication following the character you want to accent.
    I don't know if there is an equivalent program for Mac but I suspect there must be.
    If the font used does not have the correct anchors to position the accent correctly then I would seriously consider using a different font unless the client insists on that specific font, then we have a problem.
  • Igor Freiberger
    Igor Freiberger Posts: 280
    edited June 2021
    Where did you find information stating that Bulgarian uses letters with grave accent? Except for Ѝѝ, which are encoded, I see no other letter with grave in my sources. Some of them:
    The internal database of macOS' FontBook also doesn't include other characters.

    Independently of this, combinations not included in Unicode should be built with combining diacritic + base letter.* You don't need to add the composite in your font, just the letter and the diacritic. But the font needs proper <mark> and <mkmk> features to provide correct diacritic positioning.**

    How these combinations will be typed is up to the OS and its keyboard layouts. Although you can include some substitution rule into the font, this is hardly needed for well established languages, as Bulgarian. Special input methods are used for languages with poor or no support in OS.

    * This is the Unicode approach, but only for diacritics that can be positioned above or below the base letter. Combinations with diacritics crossing the base letter are eligible for encoding.

    ** I'm not adding explanation about them because there are lots of very good stuff about <mark> and <mkmk> on the web.
  • John Savard
    John Savard Posts: 1,135
    edited June 2021
    Where did you find information stating that Bulgarian uses letters with grave accent?\

    Note that he depicted the grave accent on the right-hand side of the letter, like the mark that turns the Ukrainian letter H into a G.
    I downloaded a Bulgarian font (Veleka, by Stevan Peev), and in addition to different forms for lower-case b, v, and zh, it did change the form of a number of Serbian letters. It seems that the Serbian combined N and soft sign became an N followed by a J.
    So maybe the J is the same as his grave accent.
  • KP Mawhood
    KP Mawhood Posts: 296
    edited June 2021
    @Igor Freiberger It's on wiki and r12a's character usage app.

    @Martin Wenzel The best route will be speaking with writers/readers of Bulgarian directly, but there are few hacks to get an impression of how Bulgarians may be dealing with this.

    For instance:
    • Use words/phrases with the accents as search engine terms, e.g. "въ̀лна", and consider employing a regex to match a character (and spot potential substitutes). To narrow the results, specify Bulgaria as the region in search terms, or ".bg" as the country code top-level domain (ccTLD), e.g. https://bg.wikipedia.org/
    • Search for recently published PDF files (books, newspapers, linguistic articles) where you expect the accent to appear and review the font inventory of the PDF. This should give a break down of the characters used, and whether these are mapped to Unicode or not.
  • John Savard
    John Savard Posts: 1,135
    edited June 2021
    It turns out I was mistaken; the glyphs for Cyrillic characters outside of Russian are the same as in non-Bulgarian fonts like Courier New. Not realizing where the Serbian codepoints were and not being that familiar with the Unicode Cyrillic repertoire led me to the error. However, there is an e with a grave accent as well as the Cyrillic i.


    From that wiki, I see that the grave accents are used to indicate stressed syllables, just as acute accents are used for that purpose in Russian.
    In that case, I suppose one could fake it by replacing acute accents by grave ones in the Cyrillic range for a Bulgarian font!

  • Mark Simonson
    Mark Simonson Posts: 1,739
    Depending on the application being used by your end users, one solution for supporting characters that don't (yet) have Unicode points is to add support for <mark> and <mkmk>. This allows you to attach arbitrary accents to arbitrary letters (e.g., the n with umlaut in "Spi¨nal Tap", which will probably never have a code point).
  • John Hudson
    John Hudson Posts: 3,229
    [OFF TOPIC]

    I know this was, in fact, the situation with the Burmese language several years ago, because computers with the ability to handle Unicode properly were too expensive for people there, so non-Unicode character sets with a full repertoire of pre-composed glyphs were what was normally used.

    The situation with Burmese wasn’t to do with ‘computers with the ability to handle Unicode properly were too expensive for people there’. The problem was that there were no computers that handled Burmese OpenType shaping properly. The embargo against the military regime made it impossible/unattractive for foreign companies to do business in Myanmar, which led to very slow implementation of Myanmar script shaping in operating systems, leading to development a local hack encoding/shaping model to fill the gap.
  • John Savard
    John Savard Posts: 1,135
    @John Hudson
    Thank you for the information. My source for this just mentioned the inequity that the languages of "economic importance" had precomposed glyphs as characters, and the other ones didn't, without giving any other reason.
    I know the Unicode consortium was against precomposed glyphs except where they appeared in existing computer or communications standards, which would have that result.
    One problem with this is that this is likely to prevent people using their own language for things like file names or variable names in programs, because the pre-composed glyph character is the only representation that is sufficiently unique to be considered safe.
  • John Hudson
    John Hudson Posts: 3,229
    There are plenty of writing systems—including Burmese—that cannot be reasonably handled by encoding precomposed character combinations. The Zawgyi hack encoding a) produces a kind of simplified Burmese layout in which some combinations look sort of okay and many do not, with no method to be able to refine the output, and b) only works for Burmese language and not for the numerous other languages using the script.

    One problem with this is that this is likely to prevent people using their own language for things like file names or variable names in programs, because the pre-composed glyph character is the only representation that is sufficiently unique to be considered safe.

    This is (one of the reasons) why normalisation exists and is part of Unicode.
  • Vasil Stanev
    Vasil Stanev Posts: 775
    edited June 2021
    The grave accent is used to distinguish omophones in our language, like вЪлна and вълнА. It is almost never used outside of dictionaries. The female pronoun ѝ is assigned to shift+x (ex) on my keyboard and is sometimes included in fonts because it is much more common. The coding matters are discussed above. Lack of tech support or othere reasons are why ѝ is sometimes substituted with й (sounds like the 'y' in 'yeast'), but this is not grammatical.

  • Firstly, thank you very much to KP Mawhood for the references and Denis Jacquerye for the observations.

    I am not sure if we should consider these other vowels with grave as part of the needed characters to support Bulgarian. Let me explain why.

    The additional vowel+grave combinations are used only to indicate stress in dictionaries and linguistic studies. The "Бъ̀лгарската а̀збука" is an example: Wikipedia articles in Cyrillic realm use to mark stresses in the first title appearance. It's a perfect parallel of articles from Latin realm, whose do that using IPA.

    In other words, additional vowels with grave in Bulgarian are similar to IPA symbols in other languages. Not part of the current language, but needed to meta information. It remembers the Latin language, which uses only the basic Latin alphabet. But any dictionary and linguistic study of Latin will need vowels with macrons. These are or are not part of Latin alphabet?

    Thus, the question is what we call a language support.

    Another example: any phonological study about English will use IPA symbols. But these symbols are not considered part of English alphabet. In the same way, Éé, Ïï, and Çç are needed to write fiancé, naïve, and façade, words included in English lexicon. But are Éé, Ïï, and Çç actually needed to English? AFAIK, no because these words can be correctly written without diacritics. But if I am preparing an English dictionary, Éé, Ïï, Çç, and several other characters outside the regular English alphabet will be needed. So, where to draw the line?

    I have a database of scripts, languages, and orthographies to identify which languages a font supports. For now, I restrict the "language support" to the current use, excluding meta information. It's also the criterion used by Apple (Font Book).

    Finally, regarding Bulgarian, I trust the Local Fonts site as a primary source because its author, Stefan Peev, is Bulgarian and a type designer who knows well the different Cyrillic orthographies.
  • @Igor Freiberger I think I agree with what you saying but think it could be a bit more nuanced. If language support is tiered, then standard Bulgarian orthography doesn't need letters with grave except for ѝ. However I would make a difference between educative Bulgarian orthography and phonetic transcription. The IPA is its own writing system, whereas educative Bulgarian orthography indicating stress is an extension of the standard Bulgarian orthography.

    So for any orthography, it would make sense to have requirements for the standard orthography, requirements for educative or poetic orthography if it exists, and requirements for borrowed words (like the examples you give for English) if they occur. It would be fair to say a font supports a language if it supports the first, but covering the other two would allow better flexibility for a variety of uses.

  • Dear all, thank you very much for all for your input on this matter. My understanding was as well that the Cyrillic а̀ о̀ у̀ (and others) in Bulgarian are only used in dictionaries, or when the language is taught but the client emphasized that this is not true and that these glyphs are indeed needed (one of the examples they gave: “парà” means money, “пàра” means steam).
    I think I will go with the <mark> solution to allow the users to compose those glyphs plus precomposed versions of these glyphs (“ligatures” if you will), which I've seen in a font by Bulgarian designer Vassil Kateliev.
    A great overview of the requirements of Bulgarian (and 655 other languages!) is to be found on RosettaType’s Hyperglot https://hyperglot.rosettatype.com/

  • Stefan Peev
    Stefan Peev Posts: 103
    Martin Wenzel said:
    My understanding was as well that the Cyrillic а̀ о̀ у̀ (and others) in Bulgarian are only used in dictionaries, or when the language is taught but the client emphasized that this is not true and that these glyphs are indeed needed (one of the examples they gave: “парà” means money, “пàра” means steam).

    Is your client a Bulgarian one? If he is his knowledge of the norms of Bulgarian literary language is strange. In the "Official Spelling Dictionary of the Bulgarian Literary Language" (page 104-105) is clearly written that the grave [`] is used in the following cases:
    - on vowels in individual words to avoid ambiguity [сèдмица (week) vs. седмѝца (seven)]
    - on vowels in the particle [по] for grading nouns, verbs and prepositional combinations [пò юнак, пò обичам, пò към тебе, още пò на запад]
    - in case of transcribed proper names from foreign languages, if it is necessary to indicate the place of the stress [Мàртин and Мартѝн, Àгата and Агàта, Ивàнов and Иванòв]
    - in specialized publications such as dictionaries, reference books, encyclopedias [кàжа, кàжеш, кàжем; кàзах, кàза; кàжех, кàжеше; кàзал; кàжел; кàзан; кажѝ!, кажèте!]
    - in the short forms of the personal and possessive pronouns for the third person feminine singular [Трябва да ѝ изпратиш поздравления. Книгата ѝ лежеше на масата.]
    However, all these cases do not refer to the usual practice of using the Bulgarian literary language. In other words, accented vowels are rarely used. Only the accented letters Ѝ (uni040D) and ѝ (uni045D) are used extremely often, but they have separate unicode numbers.
  • John Hudson
    John Hudson Posts: 3,229
    - on vowels in individual words to avoid ambiguity [сèдмица (week) vs. седмѝца (seven)]

    That is precisely the use case illustrated by the парà/пàра example provided by Martin’s client. Because such distinctions are not always marked in Bulgarian text does not mean that they do not sometimes need to be marked, and the fact that use cases for the grave mark are explicitly described in the official spelling dictionary indicates that these cases should be supported in fonts for Bulgarian text.
  • Another useful resource is CLDR data for Bulgarian.
  • Btw, I think "Font technology" would have a better categorization for this thread rather than "Type Design Critiques".
  • Btw, I think "Font technology" would have a better categorization for this thread rather than "Type Design Critiques".

    I must have clicked wrong. Not sure, can an administrator of typedrawers.com change the categorization?
  • It so happens that Bulgarian uses grave while Russian uses acute — fort just the same purpose. The acuted Cyrillic vowels also don't exist as precomposed Unicode codepoints. They're used to indicate stress in schoolbooks or Russian-as-a-foreign-language books. 
  • Adam Twardoch
    Adam Twardoch Posts: 515
    edited August 2021
    Overall, many orthographies offer »pronunciation helpers«. Hebrew and Arabic have the small vowel marks and/or cantillation marks, and the practice of including them in »pro« fonts is well-established. But for the Cyrillic stress marks (acute & grave), it's still »mysterious«. 
  • Adam Twardoch
    Adam Twardoch Posts: 515
    edited August 2021
    The problem is with OS vendors. We effectively have 4 platforms (Android, iOS, Windows, macOS), and there are now fancy emoji input methods in all of them — but the keyboard layouts they ship for traditional text input are still 30 years old.

    None of these platform includes combining marks in the standard keyboard layouts for most languages, and many fail to even include proper punctuation (as in emdashes on Windows). Even touch keyboards with flyouts à la iOS include absurd subsets of accented letters: č or ž exists in iOS English keyboards but ý doesn't.

    And the funny thing is: keyboard layouts are easy to create, they're TINY and OSes can easily ship multiple layouts per language. 

    I think the type community would benefit from teaming up with linguists to create a set of »Typo« layouts for major languages, esp. the LCG ones (bacause those are the ones that stay behind the most). 

    Make a common repository, submit them to CLDR/Unicode, produce the layouts in various formats, and submit them for inclusion in various OSes. 

    There are even 3rd-party keyboard vendors for mobile apps (I use Google’s Gboard on iOS). Those vendors compete, and they have a business case (data collection, machine learning for better text input prediction). I wouldn't be surprised is one of these vendors would agree to sponsor such a project. 
  • John Savard
    John Savard Posts: 1,135
    edited August 2021
    None of these platform includes combining marks in the standard keyboard layouts for most languages, and many fail to even include proper punctuation (as in emdashes on Windows).
    This deserves a brief comment.

    Obviously, a French keyboard had better include a way to enter all the accented letters used for French. AFAIK, most do accomplish that - but only for lower-case. And so the French government, as I noted upthread, started an initiative to correct that.

    When it comes to the em-dash being part of essential punctuation for English, though, that's what I felt I needed to comment on.
    Computer keyboards started out as upper-case only, with characters taken from the typewriter keyboard. Perhaps a few characters relevant for computing were added.
    Now that we have ASCII with lower-case as the standard on which the English keyboard is based, we have a few additional characters, but none the less, the computer is not envisaged as a typesetting device (even if, thanks to the laser printer, it most definitely can be used as one). So we don't even have the characters found on, say, the IBM Selectric Composer, such as the dagger and directed quotes.
    Having, in addition to the data processing keyboard layout, a word processing layout (in which things like paragraph, section, degree, and plus/minus would be added), and a typesetting layout for each language would be a good idea. Making the typesetting layout the default, however, and putting it on the keycaps would be... intimidating.