Full character set of Latin script

Mithil Mogare · September 2023

Hello,
I am developing latin script typeface. For that I need a full glyph set or characters set of the letters in latin script. I am getting confused while adding glyphs in FL7 because it is mixed.
I accidently added some character which has base of latin letters like bdot, mdot, ndot and so on... Later I got to know that some of them are not latin characters.
Where I can get all the accented letters and supportive characters of latin script?
Also, which font should I refer for latin glyphs set?
Does somebody have published the mandatory characters set that cover all essential glyphs in latin script?
Thank you.

Simon Cozens · September 2023

There isn't really any such thing as "the" "full" character set of the Latin script; it depends what languages you are targeting.

If you want to cover everything that Unicode considers "Latin", then you will need to fill all codepoints 0000...024F, 1E00..1EFF, 2C60..2C7F, A720..A7FF, AB30..AB6F, 10780..107BF, 1DF00..1DFFF. That's 1484 codepoints (some of which are not allocated). But that doesn't include some of the IPA extensions which you are needed writing Latin in some African languages, so you would want them as well. And some of those characters are only interesting to people studying medieval manuscripts...

In reality, people tend to take a more stratified approach, with different "levels" of compliance. For example, the Google Fonts glyphsets have a "kernel" glyphset of basic Latin, then a "core" glyphset which is what we expect for most languages using the Latin script, then "Vietnamese" and "African", as well as "plus" and "beyond" for completist coverage.

GF Latin "core" is probably what you want as a standard Latin set.

Denis Moyogo Jacquerye · September 2023

The are hundreds of Latin characters which you may or may not want depending on what languages or writing systems you want to cover.

If you want basic English, GF Latin Kernel is good enough. If you want some other major languages or those with official status in Europe and North America using the Latin alphabet, GF Latin Core is good.

That seems to match your assumption that bdot, etc. are not what you expected to have to cover.

Michael Rafailyk · September 2023

https://www.alphabet-type.com/tools/charset-checker

https://hyperglot.rosettatype.com

https://en.wikipedia.org/wiki/List_of_precomposed_Latin_characters_in_Unicode

Igor Freiberger · September 2023

My base Latin set includes 1.034 characters. With it, one can write all languages that use the Latin script except the native idioms from Northwest Canada, which demand additional diacritics and mark-to-mark definitions.

I attached a txt file with the complete list to be used in FontLab.

Image: https://us.v-cdn.net/5019405/uploads/editor/xa/e7quex3xcb5f.jpg

Image: https://us.v-cdn.net/5019405/uploads/editor/ka/69fmjvju8jkd.jpg

Notes:

The set also includes characters from old orthographies used during 20th Century, as Greenlandic kra (removed in 1976) or Latvian G with slash bar (removed during early Soviet era).
Older characters, as the blocks of ancient characters from Roman Empire and Medieval characters and symbols, are not in the set.
It's always possible to go beyond. For example: the subset of phonetic-only symbols is not here and adds 388 characters. There are also dozens of other diacritics needed for full phonetic notation.
Additional subsets you may consider are linguistic variations, regional variations, and historic design variations.
The subsets of diacritics, punctuation, currencies, and letter-like symbols are the essential.

Jens Kutilek · September 2023

To add to the character set definitions already posted:

Christoph Koeberlin proposed three levels of Latin support, S, M, and L, based on language support: https://github.com/koeberlin/Latin-Character-Sets

He also compiled and extended some information about the design of certain letters his in Latin S: https://github.com/koeberlin/Designing-Latin-S

Simon Cozens · September 2023

Jens Kutilek said:

Christoph Koeberlin... also compiled and extended some information about the design of certain letters his in Latin S: https://github.com/koeberlin/Designing-Latin-S

I know this stuff is hard and type designers are trying to be helpful to one another, and so I genuinely don't mean to be snarky but... throwing out these lists and (particularly) images of glyphs unsourced is, in my view, irresponsible. It encourages people to copy and fill in without really taking the time to understand what's going on. For example, the page you mention has:

Image: https://us.v-cdn.net/5019405/uploads/editor/ne/ntnmwk4cdjlf.png

Well, that's clear and helpful. Anyone following that would centre the dieresis over the width of the h.

But look above at Igor's table:

Image: https://us.v-cdn.net/5019405/uploads/editor/o8/iuetl88jjscq.png

They can't both be right - or can they? Who should I trust? Does the positioning actually matter? Maybe I can just make up whatever works for my design.

More important questions are: Why is there a ḧ and what is it for, and what do users of those letters expect it to look like? Now it turns out the answers are "It's used to transcribe Arabic hah for Northern Kurdish" and "I don't know, I would have to find out before I designed it". But you would never come to that conclusion from a table of confident-looking exemplars.

This is slightly belabouring the point but you've got me on the subject now: the problem I have with Unicode in-fill-ism is that it tends to see characters as a playground for the type designer's self-expression, when in fact characters exist as a convention between language users to enable their self-expression. Once you lose sight of the language user community, you should close your font editor and do something else instead.

Igor Freiberger · September 2023

Simon,

1.
I am surprised with your characterization of my post as "irresponsible". This is a bit too harsh, don't you think? Yes, I am sure this is not your intention, but it was what you wrote. You could, for example, question if the table has sources before calling it this.

2.
The table I posted came from years of continuous researching about orthographies. I can assure that everything there has trustful sources. And the main source I would add regarding ḧ is simply the Unicode.

But I agree one shouldn't add everything from Unicode without filtering it. So, I collected additional info (in 2011-2012): ḧ is used in Cowichan, one of the many languages from Vancouver area (source: First Nations pages, French Wikipedia, Huronia typeface).

It's also used to transliterate Arabic into Northern Kurdish. Actually, ḧ represents a sound which is not used in Kurdish so it only appears in loan words from Arabic (source: Omniglot, German Wikipedia, and "Romanization of Kurdish", an UK Government document from 2007).

3.
You started a different discussion. The thread is about what is needed to support languages using the Latin script —and not the design choices for each character. As I noted, design, linguistic, and cultural variations are not included in the table.

Anyway, as far as I could research, there is no clear "right" position for the dieresis above h in Kurdish. Until now, I got no feedback from locals about the preferable way. This issue was object of a discussion in Typophile in 2011 and also private talks with some designers.

For Cowichan, in other hand, there is a strong indication towards the dieresis above the h ascender: this is how Ross Mills designed this character in Huronia, a font specially focused on languages from Canadian West Coast.

John Butler · September 2023

I was going to suggest looking at decodeunicode, but that seems to have stopped at Unicode 11.0.0 in 2018.

Denis Moyogo Jacquerye · September 2023

For what it's worth, the reference indicating h with dieresis is used in Cowichan may be outdated or there needs to be some nuance as the relevant alphabet page [1] on First Voices or the language resource page of the school district [2] show it isn't used, at least not anymore.

Sometimes orthographies are not static and change or are replaced over time.

[1] https://www.firstvoices.com/explore/FV/sections/Data/Coast Salish/Halkomelem/HUL'Q'UMI'NUM'/learn/alphabet

[2] http://ined.sd79.bc.ca/hulqumimum-resourses/

jeremy tribby · September 2023

Simon Cozens said:

throwing out these lists and (particularly) images of glyphs unsourced is, in my view, irresponsible. It encourages people to copy and fill in without really taking the time to understand what's going on.

I don’t think it’s irresponsible and I don’t think it encourages anything irresponsible. if a type designer doesn’t get feedback from experts on scripts they are unfamiliar with, I think it’s a stretch to say the incomplete information in a github repo somewhere is to blame

Yves Michel · September 2023

Michael Rafailyk said:

https://www.alphabet-type.com/tools/charset-checker

I'm using Alphabet Type like Michael. You can there build the set you need by combining the languages you want to cover. And you're able to export the list in different ways.

I like Igor's character set, even if I don't use all the glyphs he includes.
Using these 2 references, you should be ready.

I wish you success!

Igor Freiberger · September 2023

The site Alphabet Type does an amazing job helping type designers. But it's always necessary to cross-check data with other sources. I believe the people who maintain Alphabet Type uses CDLR and locales to build its database because the problems it presents are the same I saw in these sources.

This is for my native Portuguese:

Image: https://us.v-cdn.net/5019405/uploads/editor/x6/kqz8i2mrf7as.jpg

Two necessary characters are wrong and only two auxiliary are necessary. All the others are only used for words in other languages and, if you will consider these necessary, all the Latin blocks are then necessary. The punctuation misses $, {, and }.

Andreas Stötzner · September 2023

Igor Freiberger said:

The site Alphabet Type does an amazing job …

… Two necessary characters are wrong and only two auxiliary are necessary. … The punctuation misses $, {, and }.

… and it is wrong to count # @ & % $ ¶ under “punctuation”. That alone reveals lousy work.

I would never trust any other source than my own research results.

Yves Michel · September 2023

Andreas Stötzner said:

That alone reveals lousy work.

I would never trust any other source than my own research results.

"Lousy work"! What an offensive qualifier!
For the author of this site and for us fools using it!

At least, this site exists! And it helps the ignorants we are.
But maybe you could share the result of your own researches.

In fact, that's what the author of this threat asked for. And that's what Yuri answered.

Andreas Stötzner · September 2023

Yves Michel said:
…

What an offensive qualifier!

Not at all. But a legitimate expression of a personal opinion. The Alphabet type record Igor showcases above doesn’t even reflect that the Portugese eventually use the € or % characters. I’d say ‘lousy’ is a very cosy term for this sort of “knowledge”. – Sorry for those who can’t cope with criticism.

But maybe you could share the result of your own researches. …

A worthwhile thought. However, I decide entirely for myself, what I share and with whom. That is legitimate, too.

If you get something for free and then there’s something imperfect about it, you are surprised?!

My remark about own research was to point at the importance that one ought to trust one’s own work, insights and judgement, rather than just asking for for “instant knowledge for free” and believe that’s the end of the game. If I can do own research of the kind, everyone else can do it as well.

What Igor Freiberger contributed to this discussion highlights the relevance of own research and critical review; and also the pitfalls of “just trusting someone”:

“it’s always necessary to cross-check data with other sources” – very true.

Yves Michel · September 2023

@Mr Strötzner

Sorry, I was a teacher, delivering "instant knowledge for free". It's not the end of a game, it's called education, transmission of knowledge.

Denis Moyogo Jacquerye · September 2023

@Igor Freiberger My guess is that the CLDR, and resources using it, includes ò in Portuguese because it still in some dictionaries like https://dicionario.priberam.org/cò or https://dicionario.priberam.org/prò. But considering these words are not standard spelling anymore, and are rare or peculiar, it may make more sense to have it in the auxiliary. But then the auxiliary is very unclear, it looks like characters for some European languages, but doesn’t include characters for languages of Portuguese speaking countries 🤷🏾‍♂️.

In any case, this likely reflects the opinion of the compilers of that character list. A list of characters needed for one language will depend on how you define its requirements, unfortunately that’s just not stated anywhere in those references. Like @Simon Cozens implies, it would be useful to have a rationale, some context and samples for each. Even then, some can validly disagree on what those lists should or should not include, or how their glyphs should look.

Igor Freiberger · September 2023

@Denis Moyogo Jacquerye

Regarding what you saw in Priberam, cò is arcaic and prò uses an orthography valid in Portugal (but not in Brazil) from 1945 until the 1990 Orthographic Agreement. There are also some other few words using ò under these old rules. And I agree: Ò ò could be added as auxiliary characters. Just like Ü ü, which were used until the 1990 OA.

The absence of characters from other languages where Portuguese is used is probably a simple criterion since they are not actually needed for Portuguese. But we can also see here a colonial side effect. Portugal and Brazil speaks only Portuguese —the number of speakers of other languages is extremely low. But in Africa this is different. The problem, as you know, is that things related to Africa are always undervalued and the negative colonial heritage is still very present. In Angola, for example, Portuguese is the L1 for only 40% of the population. Bantu languages are widely used, but no support for them is even considered when people discuss Portuguese usage and support.

About Alphabet Type and other sites, I regard all them as valuable but none is complete or error-free (and so I agree with both @Yves Michel and @Andreas Stötzner!). I applaud the work these people do trying to help designers, but can't help criticizing what is wrong. And it's hard to draw the line defining what is needed to support a language and what is not. It seems to be no standard about this.

So I propose a simple schema:

Support for a language: all that is needed to write the language, limited to its own words. Characters to support loan words are included only if this is an official definition (like K k W w Y y, officially part of the Portuguese alphabet).

Secondary characters: the ones needed to write in old orthographies of a given language (like È è Ò ò Ü ü in Portuguese).

Additional characters: what is needed, but it's not alphabetical nor abujida, like &, ª, º, §, @, punctuation, etc. The universal set of numbers and number-related characters goes here.

Anything beyond this is support for international nouns, like São Paulo or Schrödinger —nice, but not really needed. Otherwise we will end saying that Ḝ is needed for English and Ǻ for French!

A possible addition to this small list is geographical support, which would include characters from other languages also spoken in a given region. But this is a different issue.

Now it's time to be quiet. I already wrote too much and surely there are plenty of people more skilled than me on these themes. 😉

Yves Michel · September 2023

Igor Freiberger said:

... surely there are plenty of people more skilled than me on these themes. 😉

I doubt it!

Yves Michel · September 2023

@Mithil Mogare
As you were the one asking the question 5 days ago, could you tell us if the numerous answers were of any help? And if you appreciate?
It's so nice to have feedback.

Mithil Mogare · September 2023

I got diversified information and opened numerous ways of approaching the problem.
Thank you everyone.

John Savard · September 2023

This discussion was certainly interesting. Ideally, one would like to have a reference source that covers all the languages which use the Latin alphabet, showing the characters used by each, and the history of previous versions of their orthography.

I believe that such references do exist, but the fact that it's challenging to keep them up-to-date adds another wrinkle to finding a reliable one.

And, as a native English speaker, I find it just weird that the orthography of a major language could change within living memory. After 1800 or so, I naturally expect all orthographies, except those of indigenous languages only recently reduced to writing, to have been frozen and ossified!

Michael Wallner · September 2023

I've also been trying to figure out a character set to use, and am as frustrated and confused as you sound.

One of the things I've done is to look at the character sets developed by foundries I like and respect. Make sure to check their newer fonts, because not all there older ones have been updated to their current character set.

Underware developed their own character set they claim covers 200 languages. You can read through their process of how they developed it, and how and why they made some of their decisions.

https://underware.nl/latin_plus/

Unfortunately, I think you will drive yourself completely mad trying to cover every latin based language, and maybe never finish your font. It might be better to start with a smaller character set you feel very comfortable and confident with, so you can finish and release it. Then continue to expand your knowledge of languages and add and grow your character set as you go forward. You can always update your font at a later date.

I hope this helps.

Thomas Phinney · September 2023

If one is interested in smaller Latin character sets, both Adobe and Google have published their own set of nested Latin character sets, each one building off the previous>

Google’s Latin glyph sets (https://github.com/googlefonts/glyphsets/tree/main/GF_glyphsets/Latin) are sometimes stacked and sometimes not. That is, each set does NOT always repeat the characters of the previous set, previous sets may be needed.

Google Kernel: support ASCII + necessary punctuation and symbols for English language.
Google Latin Core: support latin alphabets for European and American languages >5M speakers (incl. Kernel). Similar to Adobe Latin 3
Google Latin Vietnamese: additional support for Vietnamese language.
Google Latin Plus: additional set of symbols for basic math and economy. This includes the above 3 sets.
Google Latin African: +413 added characters to support Latin African languages not supported by Latin Core.
Google Latin Beyond: +134 characters. Adds support for indigenous latin based languages from European and American regions (< 5M speakers), that are not supported in Latin Core.

Adobe’s Latin glyph sets
(https://adobe-type-tools.github.io/adobe-latin-charsets/adobe-latin-1.html) always include the characters from previous levels.

Adobe Latin 1, 229 glyphs, the classic ISO-Adobe set used in PostScript Type 1 fonts
Adobe Latin 2, 250 glyphs, adds “symbol substitution” Euro litre and estimated. The set used for Adobe “Std” fonts converted from PostScript.
Adobe Latin 3, 331 glyphs. Also known as Adobe CE. Adds characters needed for Central/Eastern European languages.
Adobe Latin 4, 619 glyphs (617 precomposed + 2 combined) includes Vietnamese
Adobe Latin 5, 1746 glyphs (690 precomposed + 437 combined) includes Latin African and more

Craig Eliason · September 2023

Looks like Vanilla doesn't handle URLs set within parentheses very smartly (at least with my browser). If you're trying to follow Tom's links above and they don't work, make sure to delete the trailing parenthesis.

Thomas Phinney · September 2023

Yes, I was noticing that as well, can somebody with admin powers edit my post above? (And then optionally delete this post?)

Both my links have the closing parenthesis included in the auto-URL, which makes the links invalid.

Thanks!

Mithil Mogare · September 2023

It's so nice to have feedback.

I got in depth insights. It is much helpful.

Full character set of Latin script

Comments

Categories