What character set do you usually reach for as a default when you start a new font?

John Hudson · February 2016

That's a pretty good combining mark subset, Ray. I'd add the combining cedilla U+0327, because I always use the combining mark glyphs to make composites. And U+0315 is useful for the Slovak 'caron' composites.

Your notes for U+0337 and U+0338 are misleading, though. U+0337 is a short, mid-height stroke, of the kind one sees on the Polish ł (although that diacritic does not have a decomposition); U+0338 is a long slash, so conceivably useable to make things like both Ø and ø, except those don't have decompositions and Unicode notes somewhere that overlays are not expect to be rendered with combining marks, but are best handled with precomposed glyphs. I don't really consider either of these necessary for the kind of non-specialist and display fonts you're talking about.

Kent Lew · February 2016

@D enis Moyogo Jacquerye Thanks for the information about Gagauz. I had thought, mistakenly perhaps, that the tcedilla in this case suffered from the same kind of early Unicode conflation as the Romanian character (from which I had presumed it had been borrowed).

I was aware of the legacy issues with Romanian et al., and there are reasonable debates to be had about the inclusion of the old codepoints for these purposes. I do agree that if they are to be included, then they should drawn as such, with proper cedillas, to harmonize with scedilla.

And I have heard Maxim say the same thing about modern use of Yat’ et al. That is why I put “historic” in quotation marks.

Kent Lew · February 2016

And U+0315 is useful for the Slovak 'caron' composites.

John, I’d never considered that one. Isn’t the Slovak preference not to have this form of haček look like a comma or apostrophe? You wouldn’t use U+0315 for actual decomposition, would you? Or am I taking the Unicode name too literally?

Richard Fink · February 2016

Ray Larabie said:

... I think this topic is very important. I imagine there are a lot of new type designers who are curious about extending Latin language coverage.

I'll be quoting you on that. Not only a lot of new type designers and designers with little formal training, but also web developers, and who knows who else who wants to understand what they're getting for their money when they license a font.
Certainly there are languages with large populations of speakers where we know with certainty what characters are needed. Past that point, it's like a new frontier. But there is absolutely no reason why, today, it needs to be that way. Every language has a Wikipedia page. And that's usually just the beginning. Thirty years ago it must have been very very difficult to become language-savvy in a big way. I give the scholars and researchers who had to do without the Internet a lot of credit.
One thing web fonts have surely done: they've really put the "World" into the phrase "World Wide Web". Multilingual support is more and more expected.

John Hudson · February 2016

John, I’d never considered that one. Isn’t the Slovak preference not to have this form of haček look like a comma or apostrophe? You wouldn’t use U+0315 for actual decomposition, would you? Or am I taking the Unicode name too literally?

Too literarally, I think. As far as I know, U+0315 was encoded specifically for compatibility with older encodings that used a postscript mark, following typewriter practice, to write the Slovak Ľ ď ľ and ť. Those diacritics do not decompose to U+0315, of course: they decompose to the combining caron U+030C; that's convenient for case mapping, of course, but is a headache if one wants to display any of those letters with an actual caron mark.

John Hudson · February 2016

Thirty years ago it must have been very very difficult to become language-savvy in a big way.

Even twenty years ago. The first version of our website tried to provide reliable information on alphabets of European languages, c.1995, and even that was quite difficult, with not a lot of material in English on eastern European alphabets. It was also a time of some uncertainty: with Turkic languages of the former Soviet Union adopting new Latin alphabets, and nationalist politics and violence in the Balkans splitting Serbo-Croat into three separate languages.

In 1997–98, I worked on a language and script research project for Microsoft, which involved a lot of time in university libraries photocopying books and taking notes. Today, that kind of project would be much easier, and a lot of the work could be done from my desk using Web resources, at least for major regional languages.

John Hudson · February 2016

Talk of the WRIT project research reminded me of this presentation, which I gave during the pre-conference tech sessions at TypeCon 2008. No commentary, just the slides:

African Alphabets (2008)

Kent Lew · February 2016

As far as I know, U+0315 was encoded specifically for compatibility with older encodings that used a postscript mark, following typewriter practice, to write the Slovak Ľ ď ľ and ť.

Good to know that this codepoint is specifically related to those characters. It would have been nice to have that kind of information noted in the Unicode chart. Or at least in the Unicode Standard. The alternate graphical forms of ď ľ et al. are mentioned several times, but no reference to the U+0315 for legacy purpose (or any other).

BTW, I was very sorry that the WRIT project never reached fruition.

Igor Freiberger · February 2016

The most interesting links to search were already posted here and in other threads. But I would like to add some other. The first one, from Unicode, is quite important – although the overwhelming number of items makes navigation slow. The last one has a list of scripts with their proposals to Unicode. These proposals are, many times, the best documentation one can find regarding ancient alphabets, less known languages or rare characters.*

Script Encoding Initiative

About the Underware's infographics, they are impressive, but some data are wrong or at least dubious, as Frode already related. I did not tried exhaustively, but Æ and ĸ usage information are wrong.

* My order of confidence to know about scripts, languages and strange characters is: 1. To see if John Hudson posted something about it; 2. To find the proposal presented to Unicode; 3. To get information from a specialized site; 4. To see what Wikipedia says.

Kent Lew · February 2016

* My order of confidence to know about scripts, languages and strange characters is: 1. To see if John Hudson posted something about it;

+1! :-D

John Hudson · February 2016

Now you're really making me embarrassed about the times I get it wrong. I feel I should preface every statement with 'The current state of my knowledge is...'

I treat information from @Denis Moyogo Jacquerye in the same way you apparently treat information from me.

Richard Fink · February 2016

This thread started out a little lame but it picked up steam. Thanks to all. The topic rates the effort.

Here is a real world problem I ran into. It has to do with Gujarati.

Google web fonts has at least one Gujarati extension in the works for an existing typeface and I wanted to make a test page to make sure the font has all the characters needed for a writer/speaker of Gujarati to communicate.

Now, in Glyphs, when I click on Gujarati, here is what I see:

Image: https://us.v-cdn.net/5019405/uploads/editor/tw/wfkss26tw967.png

Now, unless I'm missing something, this means the current empty test font I've got loaded in Glyphs is missing everything Gujarati. Zero vowels, consonants, etc...

And the total of - is it characters or glyphs or a combination of both? - is 16+56+33+56+14+12 = 187. 187 "slots" in the font needed for Gujarati, I am presuming. (And that is if I am presuming right and Glyphs is right, both.)

Now, Unicode's Gujarati page lists 85 characters. So what's up with the 82 character difference?

Now, I was working with a .nam file for Gujarati on Google Web Font's Github repo that had Unicode points but no Unicode names. I added the names and found that some Devanagari characters are included (this makes sense, right?) and, of course, the rupee. This brought the total number up to, I think, around 101 characters. Still far short of what Glyphs is showing.

What's goin' on? Clues?

Igor Freiberger · February 2016

John, I (and surely many others) really appreciate your comments. They are highly informative, objective, polite, generous and comprehensive. Very rare mistakes do not alter this, especially because your are always keen to explain changes and add new discoveries.

Of course, many fellows also contribute a lot. I learnt very much from Denis, Stötzner, Twardoch, Phinney, Frode, Hrant, Ellertson, Karsten, Khaled, and Zhukov, to name a few.

George Thomas · February 2016

@Richard Fink Do the glyphs in the font use the Glyphsapp naming convention? If not they will appear to be missing.

Richard Fink · February 2016

@George Thomas I think we're talking apples and oranges. The glyphs in the font are definitely not there. Purposely. I'm turning to Glyphs as a tool to provide guidance as to what glyphs need to be there. And I'm trying to find out why there are such big discrepancies between my sources of information.

kupfers · February 2016

Did those of you who found incorrect things in the Underware info send them a note and suggestions? I’m sure they would appreciate it.

George Thomas · February 2016

@Richard Fink You're right. I was assuming you had opened a font in Glyphs that was created elsewhere.

John Hudson · February 2016

Now, Unicode's Gujarati page lists 85 characters. So what's up with the 82 character difference?

I presume Glyphs is reporting the number of glyphs it considers to be necessary for Gujurati support (based on its internal automation data, which might or might not be equivalent to what a designer considers necessary for a given design), not the number of characters needed. So, for example, none of the 56 conjuncts reported would be encoded characters: they're ligatures.

Richard Fink · February 2016

John Hudson said:

Now, Unicode's Gujarati page lists 85 characters. So what's up with the 82 character difference?

I presume Glyphs is reporting the number of glyphs it considers to be necessary for Gujurati support (based on its internal automation data, which might or might not be equivalent to what a designer considers necessary for a given design), not the number of characters needed. So, for example, none of the 56 conjuncts reported would be encoded characters: they're ligatures.

Thanks for the clue. Hmmm.... If they don't have encodings, where do I find them? Just thinking out loud. I've visited a lot of sites that go into Gujarati in some depth, but what glyphs are needed, isn't clear cut.

kupfers · February 2016

Mh, this is getting the thread off topic but not sure it’s better to make “something w(sorry, can’t bring myself to use that word) yourself instead of also sending the Underware people a brief note about the things you discovered are wrong. Sounds a bit arrogant to me to complain like this about it. I think it’s the first case this kind of comprehensive overview is made easily accessible in one place so that not everyone has to start from scratch. Other foundries just keep this to themselves.

Igor Freiberger · February 2016

I agree with your suggestion, Indra, and sent Underware a message regarding Latin Plus. As the site is used as reference and even offers a validation tool, it would be good if they improve it. A summary:

1. Some items shows the number of languages and users based on any usage of that character. The same criteria is adopted to indicate "required language specific characters”. This is wrong for English as æ and ç are not required. The same for Italian, as á and ï are also not required. Information about Portuguese and Spanish are correct.

2. For Greenlandic, Latin Plus says that ĸ is used by the whole population of Greenland. As it is a deprecated character from pre-1973 orthography, this is not correct. The site would indicate when a character is required just for historical support.

3. Quechua is reported as a younger brother of English, but it is a native South-American language with no relation with English. The same about Indonesian and Albanian, also shown as English brothers.

Robin Mientjes · February 2016

Igor Freiberger said:

3. Quechua is reported as a younger brother of English, but it is a native South-American language with no relation with English. The same about Indonesian and Albanian, also shown as English brothers.

The ‘brother’ relationship logic in Latin Plus is something that either Frode or I brought up with Underware late last year. I still don’t fully understand the point of it, but it just makes a sort of ‘least differences’ connection. It’s not very useful for understanding language differences, at any rate.

Kent Lew · February 2016

I've visited a lot of sites that go into Gujarati in some depth, but what glyphs are needed, isn't clear cut.

Rich — You are going to run into that problem with most all of the Indic languages, and others as well. There is not a one-to-one relationship between the characters required to encode the language and the glyphs required or desirable to represent the language typographically.

The issue of conjuncts is not a sharply contained one. There may be grey areas regarding which are “required.” And there will be different design approaches to supporting them (pre-composed ligatures vs modular components). Thus a definitive list of glyphs may be elusive.

Because they are unencoded and rather the result of codepoint combinations and interactions, you may find it tricky to set up a one-size-fits-all investigation.

James Puckett · February 2016

Now, unless I'm missing something, this means the current empty test font I've got loaded in Glyphs is missing everything Gujarati. Zero vowels, consonants, etc...

The font you opened probably doesn’t use the same glyph names as Glyphs.

Now, Unicode's Gujarati page lists 85 characters. So what's up with the 82 character difference?

Indian languages tend to lack standard spelling and pronunciation. In part because languages us foreigners read about tend to encapsulate lots of minority languages. And which of those minority languages is an accent, a dialect, or a language unto itself is often not a settled matter. There’s historical, cultural, and political baggage attached to this, so you’re not going to get one answer about things involving Indic writing systems.

John Hudson · February 2016

BTW, my nomination for the most unnecessary characters in Unicode is for U+0953 and U+0954. Sure, the kind of thing Ray notes — historical characters no longer in use, Uralist phonetic notation letters, diacritics for transliterating archaic Persian, etc. — may reasonably excluded from large numbers of fonts, but at least specialists somewhere in the world have a need for them, ergo Brill. U+0953 and U+0954 are simply an error: they should never have been encoded in the first place.

Igor Freiberger · February 2016

When I read "the most unnecessary character in Unicode" I thought you would point to U+203D and U+2E18. Inverted interrobang proves that to invert an error does not produce a success.

Ray Larabie · February 2016

If anyone feels like this has gone off topic, there's a general Unicode thread here.

Richard Fink · February 2016

Kent Lew said:

I've visited a lot of sites that go into Gujarati in some depth, but what glyphs are needed, isn't clear cut.
Rich — You are going to run into that problem with most all of the Indic languages, and others as well. There is not a one-to-one relationship between the characters required to encode the language and the glyphs required or desirable to represent the language typographically.

The issue of conjuncts is not a sharply contained one. There may be grey areas regarding which are “required.” And there will be different design approaches to supporting them (pre-composed ligatures vs modular components). Thus a definitive list of glyphs may be elusive.

Because they are unencoded and rather the result of codepoint combinations and interactions, you may find it tricky to set up a one-size-fits-all investigation.

Tricky indeed. At least Gujarati. However, that said, I have been told by Frank Blokland that the language analysis feature in OTMaster is totally trustworthy.
I hope Frank is right because I do rely on it a lot.
And there are Indic languages listed. Just not Gujarati.
But hey, just knowing which orthographies are more slippery than others, helps a lot.
BTW - the MS Typography site has a tutorial on Gujarati that I printed out as a reference but haven't read as yet. Gotta check it out.

John Hudson · February 2016

Rich: http://www.unicode.org/versions/Unicode8.0.0/ch12.pdf

Read the entire Devanagari section before reading the Gujurati section, as it provides the archetype for Indic script encoding and processing.

The table of Gujurati conjuncts is representative, rather than exhaustive.

Richard Fink · February 2016

John Hudson said:

Rich: http://www.unicode.org/versions/Unicode8.0.0/ch12.pdf

Read the entire Devanagari section before reading the Gujurati section, as it provides the archetype for Indic script encoding and processing.

The table of Gujurati conjuncts is representative, rather than exhaustive.

Thank you. I will. I've already gotten a peek at the scope of the conjuncts. There's a lot. Tough character/glyph set to nail down.

What character set do you usually reach for as a default when you start a new font?

Comments

Categories