The Mysteries of the Unicode Table

Ray Larabie · January 2016

Rather than start a new thread every time I have a question about some obscure glyphs, I figured it'd be better to create a new thread as a catch-all. Feel free to chime in with your own questions.

This isn't about whether or not certain glyphs are valid or worth bothering with. It's about knowing what they're used for some you can make a better judgment call as to whether or not including them. For example, if you're making a text font that might be used in a dictionary, you might want to include IPA characters. If you're making a display font, you might want to leave those out.

Anyway, today, I'm fooling around with Latin Extended Additional and I got to the 1E80-1E85 range. ẀẁẂẃẄẅ
I've been including these glyphs for about a decade simple because someone emailed me and told me that they're needed for Irish...or was it Welsh. It's been a while; I can't remember. Anyway, I looked them up today and I can't figure out who, if anyone, actually uses these. Any ideas about these accented W's?

John Hudson · January 2016

Welsh. Also ŵ. But relatively rare.

http://www.200words-a-day.com/typing-welsh-characters.html

Ray Larabie · January 2016

Thanks. Here it is down in the section on diacritics: en.wikipedia.org/wiki/Welsh_orthography

Belleve Invis · January 2016

There's even something more interesting. For example, do you know Unicode has two half o's? That's U+1D16 & U+1D17 (ᴖ and ᴗ). I am really curious about who use them.

Fred Wilson · January 2016

Ray, do you already have this link in your collection?

https://fr.wikipedia.org/wiki/Diacritiques_de_l'alphabet_latin#Tableau_r.C3.A9capitulatif

It's in French, but seems pretty easy to understand (at least for the table part). Unfortunately, I didn't find an English equivalent.

Georg Seifert · January 2016

Wasn’t decodeunicode.org meant to document this kind of stuff?

Georg Seifert · January 2016

And I usually look at: www.eki.ee/letter/

Max Phillips · January 2016

If one adds the tironian et (⁊), should one also add ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ? These letters are commonly transliterated to letter+h in modern Irish orthography.

I think ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ would only be useful for scholarly work. The Irish government mandated their replacement by a boatload of initial h's sometime in the 1960s, and I believe pretty much all daily readers of Irish are used to reading this way.

As for the Tironian et, I don't recall every seeing one in my 2 1/2 years here. FWIW, Irish road signs use a lot of archaic forms. They're all set in a proprietary version of MOT Transport.

John Hudson · January 2016

I think ḃ ċ ḋ ḟ ġ ṁ ṗ ṡ ṫ would only be useful for scholarly work. The Irish government mandated their replacement by a boatload of initial h's sometime in the 1960s, and I believe pretty much all daily readers of Irish are used to reading this way.

My understanding from Michael Everson is that both the traditional and reformed orthographies have official status, although the former is limited in contemporary use and mostly found only when the traditional script form is also used.

John Hudson · January 2016

There's even something more interesting. For example, do you know Unicode has two half o's? That's U+1D16 & U+1D17 (ᴖ and ᴗ). I am really curious about who use them.

Uralicists. These characters are not, to my knowledge, used in any natural language orthographies. They are part of a system of the Uralic Phonetic Alphabet, used by linguists primarily in reconstructions of proto-Finno-Ugric and -Samoyed languages.

James Puckett · January 2016

Georg Seifert said:

Wasn’t decodeunicode.org meant to document this kind of stuff?

I thought that was what scriptsource.org is for. But those are places to post answers, this can be the place to ask the questions.

Dan Reynolds · January 2016

What is so strange about the Tironian Et?

Ray Larabie · January 2016

Ray, do you already have this link in your collection?

I've encountered it but I usually get to the same destination by Googling the glyph + wiki.

www.eki.ee/letter/ is useful for more common glyphs. For example: Latin Extended B hasn't been filled in much. The Serer language uses Ƈ (0187), but no languages are listed here. Over a million people speak a language that uses that glyph but it's hard to determine that from any of the links mentioned.

Very often, the French wikipedia pages have more detailed explanations of alphabets. For example: Serer French vs Serer English

One of the more useful resources I've been using is Omniglot.

Denis Moyogo Jacquerye · January 2016

Unicode’s Latin extensions are a mixed bag of things. Many IPA characters needed for some language orthographies and have uppercase in the Latin extensions, and there’s also a couple that don’t have uppercase. So if large language support is part of the scope of your project, don’t leave it out completely.

Ẁ is used in Welsh, Ẃ in Lower Sorabian, they could also be used in some tonal languages. Ẅ is used in a few Cameroon languages.

Serer has used Ƈ for a while. That was also made official in a Senegalese decree from 2005.

Denis Moyogo Jacquerye · January 2016

@Frode Bo Helland You’re right, sorry. It was used in Lower Sorbian.

Igor Freiberger · January 2016

Regarding Wikipedia, it is mostly a decent source of information about languages and alphabets. But many times the English version is not the best one, as Ray already pointed.

A good criteria is to check the colonial language, like Spanish for Central and South America, French for Western Africa and Pacific Islands, Dutch for Caribbean Islands and so on. Another criteria is to verify the Deutsch version as Germans have a long and deep tradition on linguistic studies. And, surprisingly, some Russian pages are quite good about Central and East European languages.

An example: the extinct Polabian language has a good page in English, but the German page is better and the Russian one is impressive.

I already got caught in a search error, kindly corrected by Nicolas Silva here in TypeDrawers: the Brazilian stuff I read about Guarani did not mention G̃, but the studies published in Spanish are more complete and include it in the alphabet. As Guarani is limited to very few regions of Brazil, but widely used in Paraguay, Bolivia and north of Argentina, I should suppose the best source wouldn't be in Portuguese.

A less known way to get info is to read proposals submitted to Unicode. These documents are usually quite informative, with samples of use and historical background, what may be extremely relevant to define how to design the characters. SIL has a page with their proposals, but most of them are spread over the web.

Ray Larabie · February 2016

Does anyone have any tips for lowercase chi? I can get a handle on most of the Greek alphabet but that one letter eludes me. I went through the docs on Gerry Leonidas' site. But I still can't figure chi out. What other letters does the stroke relate to? What other letters can I use for reference for the angle or width? Pictured below is Corbel's chi.

Image: https://us.v-cdn.net/5019405/uploads/editor/92/6o9nyafoz7od.png

Richard Fink · February 2016

Ray,

fileformat.info is a really sweet site that will give you a list, with links, of many fonts that support the Unicode point you're looking for.
http://www.fileformat.info/info/unicode/char/03a7/fontsupport.htm
It gives a list of at least a hundred fonts that have chi glyphs. Many of them available for free download.

Rich

Ray Larabie · February 2016

I know what a chi looks like, i just don't get how it relates to the rest of the Greek lowercase. Like sometimes it's drawn like a lowercase x that reaches the descender line. But that type of chi design seems to be restricted to designs with heavy Latinization such as ones with the eta and upsilon with no leading curl. In Corbel, the chi stroke goes past the left sidebearing...and it looks correct to me. The chi seems to relate the the lambda but has hardly any resemblance to a lowercase x. The stroke gets close to the eta's descender and I'm guessing that's how you're supposed to determine how far past the left sidebearing it can go. Does anyone have any good chi strategies?

Igor Freiberger · February 2016

Ray: I would make lowercase chi coherent with lowercase lambda.

Frode: In my searches, I found several fonts using the bar crossing the G descender also when it is a one-bowl g. And none using it at the upper position. So I keep the position regardless it is an one- or two-bowl G.

Regarding Kadiwéu, the language was documented by the first time in 1977 by a couple of SIL linguists. They used a hyphen over G as this is easily achieved with typewriters. So, the Kadiwéu variant may be as your right sample –this is also the way I built it. Publications with Kadiwéu are very rare and, due to lack of fonts, no one used the barred-G except those made with typewriters.

Ray Larabie · February 2016

@Frode Bo Helland
There are no other stroked letters in that alphabet so it doesn't have to match anything. Personally, if my typeface didn't already have a hooked g, I'd combine q, dotless j and put the stroke under the bowl. I don't think it can look attractive with a binocular g, except in lighter weights. A bold Ǥ is problematic if your G has a horizontal stroke. Once I dealt with this be removing the horizontal stroke, extending the horizontal stroke and adding the bar.

Some glyphs were created to punish type designers.

Ray Larabie · March 2016

Is combining grapheme joiner (034F) the same as zero width non joiner (200C) and zero width space (200B). As far as font construction goes, are all three of these simply zero width blanks?

Richard Fink · March 2016

Ray Larabie said:

Is combining grapheme joiner (034F) the same as zero width non joiner (200C) and zero width space (200B). As far as font construction goes, are all three of these simply zero width blanks?

Ray, I sent in a proposal for a presentation at TypeCon in Seattle this year titled "Empty Space Characters In Modern Character Sets" and I was unaware of the grapheme joiner 034F until your post, so thanks. (Gotta find out what that's about.)
But I do have a partial answer to your question if I understand it correctly.

In browsers, at least, the behavior described in the Wikipedia entry for Zero-Width Space is accurate and you can even test that behavior right there on the Wikipedia page itself by resizing the browser window.
It says:
"In HTML pages, the zero-width space can be used as a potential line-break in long words as an alternative to the <wbr> element."

And so the zero-width space is not simply empty space. It's more akin to a control character. But one that only kicks in under certain circumstances such as when the viewport is too small to display an unbroken string of text and the character has been inserted at the preferred breakpoints.
AFAIK - it's most useful when you've got a long URI and you don't want it to break in a weird spot in a small viewport. Maybe long place names, too - I'd have to think about it.

And yeah, they are zero width blanks.

SOTA TypeCon presentation evaluators take heed!

John Hudson · March 2016

Is combining grapheme joiner (034F) the same as zero width non joiner (200C) and zero width space (200B). As far as font construction goes, are all three of these simply zero width blanks?

If a font contains U+034F, then yes, it should probably be zero-width, no-outline, and the same obviously true for U+200B.

U+200C and U+200D are a little different. They can be no-outline glyphs, but for scripts in which these are used as layout control characters is is helpful to have visual representations for editing purposes. Software like MS Word has an option to display control characters in text, and does so by using glyphs in the font; when this option is disabled — most of the time — display of these glyphs is suppressed. Conventions for display of these and other layout control characters varies, but typically involves a thin vertical bar to make it easy to identify the insertion point in text when displayed, topped by a small symbol indicating the character. And, of course, all zero width. These are the forms I have come to favour, the first few following Microsoft conventions:

Image: https://us.v-cdn.net/5019405/uploads/editor/p5/pf8j3v35zi8s.png

John Hudson · March 2016

BTW, note that the 'combining grapheme joiner' is a bit of an oddity, not least because it doesn't join graphemes. There was a period when it looked like it might be deprecated, but then it was found to be useful for preventing mark reordering during normalisation, which is necessary for Biblical Hebrew.

Chris Lozos · March 2016

@Ray Larabie,

The chi or lambda or most Greek lowercase are not so tightly controlled a system as latin glyphs. Greek is more free-flowing and not so dependent on geometry as Latin may appear. Think of it as crafted writing rather than construction and architecture. While the modern more Latinized Greek fonts do attain more rigidity to construction, they do not shy away from this and embrace it. The more traditional Greek forms for lowercase have more the feel of Matisse gesture drawings with vitality. They flow together rather than fit together. Latinized Greek is more like soldiers marching in step while the more humanized Greek forms are line-dancing in harmony with each other.

Ray Larabie · March 2016

The General Punctuation rage (2000-206F) is full of interesting stuff. I think the following glyphs are zero width: 200B, 200C, 200D, 200E, 200F, 2028, 2029, 202A, 202B, 202C, 202D, 202E, 2060, 2061, 2062, 2063, 2064, 2066, 2067, 2068, 2069, 206A, 206B, 206C, 206D, 206E, 206F. 2024 seems to be a period for some primordial Xerox encoding.

Two more questions:

Spacing Modifier Letters (02B0-02FF) contains the non-zero width accents we usually include in our fonts. For example: ring at 02DA. Apart from using these glyphs in composites, when are these actually used? We already have a combining ring at 030A. Who actually uses the 02DA ring? Do some applications use these as combining accents?

If I'm using regular accents for lowercase and unencoded compact accents for capitals, how do I deal with combining accent substitution? I've got @comb, @combcap and @cap classes so I can check if the preceding glyph is a capital and substitute the alternate combining accent. Should these combining accents be placed at lowercase height so applications will then raise them to cap height? If I place those alternate combining accents at cap level, will applications bump them up too high over the capitals?

Igor Freiberger · March 2016

Ray, the composites you refer actually use the combining diacritics. Spacing modifier letters are used in phonetics, programming, and also as metainformation (you use a spacing modifier tilde to show the tilde). I suppose there are other uses I am not aware.

Regarding the accent substitution, this is made with ccmp and marks. You define, for example, that á is made by a+acutecomb while Á is made by A+acutecomb.uc.

Richard Fink · March 2016

John Hudson said:

U+200C and U+200D are a little different. They can be no-outline glyphs, but for scripts in which these are used as layout control characters is is helpful to have visual representations for editing purposes. Software like MS Word has an option to display control characters in text, and does so by using glyphs in the font; when this option is disabled — most of the time — display of these glyphs is suppressed. Conventions for display of these and other layout control characters varies, but typically involves a thin vertical bar to make it easy to identify the insertion point in text when displayed, topped by a small symbol indicating the character. And, of course, all zero width. These are the forms I have come to favour, the first few following Microsoft conventions:

Thanks for pointing this out.

John Hudson said:

BTW, note that the 'combining grapheme joiner' is a bit of an oddity, not least because it doesn't join graphemes. There was a period when it looked like it might be deprecated, but then it was found to be useful for preventing mark reordering during normalisation, which is necessary for Biblical Hebrew.

The Wikipedia page for 034F does a good job of explaining. With Hebrew examples, too.

Richard Fink · March 2016

Ray Larabie said:

The General Punctuation rage (2000-206F) is full of interesting stuff. I think the following glyphs are zero width: 200B, 200C, 200D, 200E, 200F, 2028, 2029, 202A, 202B, 202C, 202D, 202E, 2060, 2061, 2062, 2063, 2064, 2066, 2067, 2068, 2069, 206A, 206B, 206C, 206D, 206E, 206F. 2024 seems to be a period for some primordial Xerox encoding.

Thanks. Gotta check all these characters out.
Sorry to go off topic, but wasn't Primordial Xerox the name of a heavy metal band from the eighties?

Chris Lozos · March 2016

No, Richard, they were just copies ;-)

The Mysteries of the Unicode Table

Comments

Categories