What character set do you usually reach for as a default when you start a new font?

24

Comments

  • Richard Fink
    Richard Fink Posts: 165
    edited February 2016
    Came upon this link at Linotype's site:

    OpenType Character Sets – OpenType Std


    Follow the link trail that starts on that page and you'll see Linotype's charsets and what they label them, plus the progression of language coverage from set to set, and notes on when and why the sets were chosen. Too bad there are no lists like .enc files or .nam files.  But they do give you a sum total of the chars in the set, and a clear image of all the glyphs in the set.  So, if you wanted, you could piece it together. Or ask them for lists, which is what I'm going to do. If I get 'em, I'll post them.

    Also, the link that @Fernando Díaz provided in his comment is worth a bookmark:

    http://underware.nl/latin_plus/

    Underware has obviously done some thorough research. Hover over a character in the chart and you get a popup that gives you details about the character. It even tells you how many languages use that particular character.

    Nice presentation. Very complete.

    John Hudson said:
    BTW, on the subject of how to indicate what scripts/languages a font explicitly supports, in Windows 10 Microsoft has adopted Apple's 'meta' font table with <dlng> and <slng> tags for 'design language(s)'.................................
    Thanks for the heads-up on that. 
  • Ray Larabie
    Ray Larabie Posts: 1,438
    The problem with the Unicode charts is that they don't indicate junk glyphs. If you comply with Unicode ranges, you end up including a lot of garbage, wasting time and bloating fonts. I think there should be characters in each range which are officially flagged as optional. That way each range could be calculated as complete, functionally complete or incomplete.
  • John Hudson
    John Hudson Posts: 3,262
    The problem with the Unicode charts is that they don't indicate junk glyphs. If you comply with Unicode ranges, you end up including a lot of garbage, wasting time and bloating fonts. I think there should be characters in each range which are officially flagged as optional.

    Optional to what? It's a character encoding standard.
  • George Thomas
    George Thomas Posts: 649
    edited February 2016
    @Richard -- The Linotype sets are available on a thread in the Glyphsapp Forum, here: http://tinyurl.com/jhuga3m -- It is a plist file with the Linotype encoding inside Item 5>Item 2.
  • Ray Larabie
    Ray Larabie Posts: 1,438
    Optional to what? It's a character encoding standard.
    Let's say Latin Extended A 0100-017F. For that range to be supported, all glyphs are assumed to be present. At least that's been my experience. When a client is adhering to some technical standard and they require Latin Extended A, they don't just mean some of it. For that particular range, there a few junk glyphs but not too many. But Latin Extended B has loads of never-going-to-be-used trash. Sure, I can try to convince a client that certain glyphs are worthless but it would be better if they could officially be deemed chaffy. As it is now, when a client requires Latin Extended B, I have to include glyph rubbish, including those idiotic ring acutes, knowing full well that they'll never, ever be used.

  • John Hudson
    John Hudson Posts: 3,262
    Well, you won't hear any argument from me that client procurement requirements are often daft when it comes to character sets, but it's hard to blame them. Most companies don't have script and language experts, or even text processing experts who really understand what Unicode is, or have the resources to spend two years — like Brill did — carefully documenting their needs and planning a major font project. So it's easier for them to simply point to what look, to them, like discreet blocks of Unicode characters and say, 'We need these'. [Of course, the blocks aren't really discreet — sometimes the casing pair for a character might be in a different block.] Some of the blocks — including Latin Extended B — are subdivided into labeled sections, and that can be useful in helping clients understand what they might or might not actually need.

    If there were secondary documentation that mapped Unicode characters to specific uses, and this documentation were reliable and suitably endorsed — whether as a de facto or de jure standard —, then it would probably be easy to steer clients to this. [Corporate character subsets like WGL4 became de facto standards for some clients; heck, I've even had people say to me, 'We want the Helvetica World set'.] I don't think such a thing belongs in Unicode itself, though, because there has to be a commitment to the essential equality of all characters within that standard. As soon as you start saying that some characters are essential and others are optional, you start penalising lesser-used languages and specific communities.
  • Ray Larabie
    Ray Larabie Posts: 1,438
    As soon as you start saying that some characters are essential and others are optional, you start penalising lesser-used languages and specific communities.
    It's certainly hard to say officially, "your language isn't worth supporting" but designating some characters as historical when they're not used in current language wouldn't be so controversial. And there are some unlikely characters that are superfluous to normal language usage that could be flagged as optional like interrobang or ring acutes without much argument. I don't think such a list will ever be produced but it sure would be nice.

  • Kent Lew
    Kent Lew Posts: 958
    I sympathize with your pain, Ray. But this has pretty much always been the case when working specifically for a client, to one extent or another, right? You often have to dig beneath the original brief to help the client figure out what the real need is.

    And you have to show them that you know what you’re talking about (or at least have resources to call on ;-).

    I have had to do this with clients — find out what languages, specifically, they are wanting to make sure they can support, or what markets specifically they are hoping to push into, now and in the foreseeable future, and help frame up a character set accordingly. And I explain to them the frivolity of the interrobang, for example, and why they might not want to pay me to draw the damn thing.

    If they still want the whole block, then you charge for it, right?

  • Richard Fink
    Richard Fink Posts: 165
    edited February 2016
    @Richard -- The Linotype sets are available on a thread in the Glyphsapp Forum, here: http://tinyurl.com/jhuga3m -- It is a plist file with the Linotype encoding inside Item 5>Item 2.
    Thank you. The list is missing the larger Linotype char sets like W1G. But it's helpful anyway.  I filled out a support contact form with Linotype (Monotype), and heard back from a guy named Jens asking what my name was.  So, we'll see. 

    Addendum: Just looked again at the Linotype support docs. The W1G set is the same as the Windows Glyph List 4.0, and that's been unchanged since 2007 or so and I have that list. That simplifies things. 
  • As soon as you start saying that some characters are essential and others are optional, you start penalising lesser-used languages and specific communities.
    It's certainly hard to say officially, "your language isn't worth supporting" but designating some characters as historical when they're not used in current language wouldn't be so controversial. And there are some unlikely characters that are superfluous to normal language usage that could be flagged as optional like interrobang or ring acutes without much argument. I don't think such a list will ever be produced but it sure would be nice.
    Agree with @John Hudson that subsetting like that doesn't belong in Unicode.  But I'm not so sure such lists don't exist.  @Ray Larabie , are you talking about char sets that are put together with the number of speakers of a given language in mind?  A "needs of the many" approach? An approach that even takes into account the most popular second languages of those language communities?  Is that the kind of thing you're talking about?
  • Chris Lozos
    Chris Lozos Posts: 1,458
    We tend to think in terms of official standards bodies instead of languages.  Perhaps this is just knee-jerk.  Perhaps we need to name and generate lists that cover languages.
  • George Thomas
    George Thomas Posts: 649
    edited February 2016
    @Richard Fink The list does contain W1G; it's Item 3 in the Item 2 subgroup. The only possibly drawback to some is that it uses the Glyphsapp naming convention although that shouldn't make a difference. They don't end up in the final font.
  • @Richard Fink The list does contain W1G; it's Item 3 in the Item 2 subgroup. The only possibly drawback to some is that it uses the Glyphsapp naming convention although that shouldn't make a difference. They don't end up in the final font.
    Can't be complete. The WGL4 set has 652 chars and the W1G adds a few more.  The box drawing chars and "other" characters - featured in the accompanying image on the Linotype WG1 page - are conspicuously missing.  

    Chris Lozos said:
    We tend to think in terms of official standards bodies instead of languages.  Perhaps this is just knee-jerk.  Perhaps we need to name and generate lists that cover languages.
    There is, on the WG1 page I refer to above, a pdf you can download that shows - with some specificity, character by character to an extent - what languages the WG1 charset covers.  Whether that pdf is totally trustworthy, I don't know.  Yet.
  • All of the Linotype lists are useful as a reference but not much more since the coverage is limited.
  • John Hudson
    John Hudson Posts: 3,262
    The WGL4 set, when originally published in the late 90s, helpfully distinguished 'core' characters, necessary for language support, from optional characters such as the line- and boxdraw characters that are really only relevant for terminal emulator fonts (they were included in WGL4 because Microsoft is one of the companies that actually cares about terminal emulator fonts and how to build and correctly identify them in the system). Unfortunately, this useful feature was removed when the WGL4 set was updated and republished.
  • Ray Larabie
    Ray Larabie Posts: 1,438
    @Richard Fink 
    Glyphs which are used in less popular languages isn't the main issue. There are lots of glyphs, scattered across the Unicode chart which are deprecated, historical or for academic use only. Under academic use, I include characters which are only used for biblical transliteration... there are a lot of those. Which is fine, but it's not necessary for every font. Maximize language coverage/reduce waste.


  • The ISO 10646 also has subsets defined for this purpose:
    • MES-1 and MES-2 (Multilingual European Subset)
    • Modern European Subset
    • Contemporary Lithuanian Letters
    • Basic Japanese
    • Japanese Non Ideographic Extension
    • Common Japanese
    • Multilingual Latin Subset
    These can be useful but they are not exhaustive, for example many languages using the Latin script use characters missing from the Multilingual Latin Subset.
  • Kent Lew
    Kent Lew Posts: 958
    edited February 2016
    The W1G set is the same as the Windows Glyph List 4.0,
    Even if you’re just talking about alphabetic chars and not miscellaneous symbols, as I recall from when I investigated this, there are some minor differences.

    For instance, the W1G set includes the “historic” ѢѣѲѳѴѵ Cyrillic chars for pre–1918 reform spelling. And the W1G also seems to specify the legacy 0x0162/163 Tcedilla chars of questionable value, which I believe the WGL4 does not. (Not entirely sure about this.)

    The W1G also specifies a more complete set of encoded inferior/superior figures and signs.


  • @Kent Lew: Tcedilla is used in Gagauz and in some romanization systems. Besides, there are still a lot of existing Romanian texts that use it even if Tcomma should be used instead. Where the W1G is wrong regarding Tcedilla, on http://www.linotype.com/5801/european-ot-character-set-w1g.html, is the shape: if Scedilla has a cedilla, so should Tcedilla. For the historic Cyrillic characters ѢѣѲѳѴѵ, I remember Maxim Zhukov pointing out that even though they stopped being used in Russia in 1918, they are still used by the Russian Orthodox Church and Russian diaspora outside of Russia.
  • Richard Fink
    Richard Fink Posts: 165
    edited February 2016
    Addendum: Just looked again at the Linotype support docs. The W1G set is the same as the Windows Glyph List 4.0, and that's been unchanged since 2007 or so and I have that list. That simplifies things. 
    I should have said "includes" rather than "the same". As Kent points out, there are a few characters worth of difference. The WG1 has a few more than the WGL4. (Shown in bold in the accompanying image on the WG1 page.)
  • Richard Fink
    Richard Fink Posts: 165
    edited February 2016
    The WGL4 set, when originally published in the late 90s, helpfully distinguished 'core' characters, necessary for language support, from optional characters such as the line- and boxdraw characters that are really only relevant for terminal emulator fonts (they were included in WGL4 because Microsoft is one of the companies that actually cares about terminal emulator fonts and how to build and correctly identify them in the system). Unfortunately, this useful feature was removed when the WGL4 set was updated and republished.
    I just put together a test page (first draft, a little ugly, but the charset should be accurate and complete and made from the latest iteration of WGL4 published by Microsoft. There are notes on the page noting from whence the information came.)
    At Windows Glyph List 4 on Github.
    I always wondered how and why the box drawing chars got there. So you're saying those were special purpose? No grand plan?  Terminal emulation? Good to know.
    @Richard Fink 
    Glyphs which are used in less popular languages isn't the main issue. There are lots of glyphs, scattered across the Unicode chart which are deprecated, historical or for academic use only. Under academic use, I include characters which are only used for biblical transliteration... there are a lot of those. Which is fine, but it's not necessary for every font. Maximize language coverage/reduce waste.
    It sounds like your character sets have become rather far-ranging. (I'm going to double back and click the link you provided earlier.)
    I don't blame you a bit for wanting to stick with characters that allow modern readers of a language to understand the meaning of what they are reading with nothing more, nothing less. And everything else off in a different category of "special purpose" or "optional" or "historical" or whatever. 
     
    BTW - @TimAhrensposted somewhere - not on Typedrawers - and I went looking for his post and I've come up dry so far - he posted that he's made a close study of this issue and there was a list of characters that, at least, he considers superfluous. Tim, if you're out there, weigh in.
    The ISO 10646 also has subsets defined for this purpose
    I'm going to give the ISO sets a fresh look. In light of Unicode, they are obsolete. But that doesn't mean they were incorrect. Thanks.

    Frode Bo Helland said:
    @Richard Fink Many of the languages Latin Plus claim to support are missing required characters or listed with wrong orthographies. Many of the sources does not support their conclusions.
    I wish you would provide two examples of where you found the Latin Plus set going wrong.  

    Golly, I feel better now about being confused.
    And I can't believe it was six years ago, but I remember when web fonts first arrived and @Ethan Dunham was putting Font Squirrel together and we would Skype about which direction the site and web fonts, in general, were taking.  At the time, he was still thinking MacRoman as some kind of default set but I managed to dislodge that idea and make sure he focused on language coverage - a prominent and useful feature you'll find on Font Squirrel to this day. Ask the average web developer what's meant by saying that a font conforms to the Latin-3 character set and he'll look at you like you have two heads. 
     
  • George Thomas
    George Thomas Posts: 649
    edited February 2016
    Latin Plus claims no “language specific characters”:
    When I first extracted their list months ago I interpreted that to mean that it shares glyphs with other languages, with nothing specific to Hopi alone. In looking at the list just now that appears to be the case.

    As for Tokelauan, the list has all the needed glyphs except for the stacked accents. I'm assuming those are special-purpose for linguistics which is why they are omitted. Omniglot does not list them. Geonames.de doesn't list the language, for reasons not known to me.
  • Well, you won't hear any argument from me that client procurement requirements are often daft when it comes to character sets, but it's hard to blame them. Most companies don't have script and language experts, or even text processing experts who really understand what Unicode is....
    @John Hudson or anybody, for that matter.  If most companies don't have script and language experts - which I'm sure they don't - do they recognize that they have a problem in that regard and are they willing to pay for a solution?  I'm thinking consulting.
  • Frode, 
    Leaving out the Western "keyboard" characters - meaning ASCII and beyond, is not a part of my logic. I think it's big mistake to leave them out of any font. 
    On this page:
    http://underware.nl/latin_plus/character_set/

    The basic Western chars are lumped together under the category: "Language non-specific" characters, whatever that means to the folks at Underware. 
  • @Frode Bo Helland There is only one list, at the download link. I don't see any in the list that have the basic character set.

    Their list really should include the accented glyph names for all languages in the list that use them. But then that could complicate things further because other sites such as Omniglot don't even mention them.
  • @Frode Bo Helland I don't have an explanation for the additional glyphs, but it wouldn't hurt to ask someone at Underware why they are there. Loanwords, maybe? I honestly don't know.
  • The user and all related content has been deleted.
  • @Frode Bo Helland I counted 24 accented glyphs for Norwegian in the list, yet other sources indicate there should be only 18-20. That's why it appears there are too many to me.

    I agree with you there are likely errors or omissions in that list, and I have found errors or omissions in other sources too. Working on adding to my character set has made me wish for an ultimate authority, but I'm not sure if that is even possible.
  • Thanks Frode. That's the info I had on Norwegian so I'm good on that.
  • Ray Larabie
    Ray Larabie Posts: 1,438
    edited February 2016
    I think I might have derailed this thread but I think this topic is very important. I imagine there are a lot of new type designers who are curious about extending Latin language coverage.

    I just want to mention a few characters that consider I academic: IPA, Pinyin and Esperanto. The reason I categorize academic characters separate is so I can skip them in display fonts.

    If you're ever explored Latin Extended B, you'll have noticed that there are loads of historical IPA characters. If you're a new to type design and you haven't bothered with combining accents because it looks like a bother, I've come up with a reduced set. If you eliminate the IPA-only combining accents, you're left with only 17 characters*.

    0300 grave
    0301 acute
    0302 circumflex
    0303 tilde
    0304 macron
    0306 breve
    0307 dot accent
    0308 dieresis
    030A ring
    030B double acute
    030C caron
    0313 comma (like gcommaccent)
    0323 dot below
    0326 comma below
    0328 ogonek
    0337 slash (like Oslash)
    0338 slash (like oslash)

    I'm not sure if these are required if you're already including an Vietnamese set.

    0309 hook above
    031B horn

    * circumflex below is used for the Venda language but there are locations for those characters in the Latin Extended Additional Range.