The Terms "Glyph List" versus "Character Set"

Richard Fink · January 2016

I have been doing work on HTML test pages that use the SIL 'Last Resort' font as a fallback which, when the character specified doesn't exist in the first font in the CSS font stack, the browser falls back to it as the next font in the stack and the character's Unicode hex value is displayed. It's nice technique for sizing up the character set of any web font.

And this has led me to a re-evaluation of the currently most-used character sets as to their relative fitness for crafting web fonts, in general.

I was wondering if the denizens of this forum like me considered the term "glyph list" to be synonymous with "character set".

Slightly different flavor but as a practical matter, the same?

Eager to know what the consensus is.

Thanks.

Nick Shinn · January 2016

Please observe the correct distinction between characters and glyphs!

In all my fonts, the glyph list is longer than the character set, including such things as small caps and alternate figures.

Characters: The smallest semantic units of a language.

Glyphs: The specific form characters can take in a font.

Kent Lew · January 2016

Although in marketing parlance, the two are often used interchangeably, from a production standpoint I consider there to be an important distinction.

Glyph List is the totality of the actual glyphs in the font.

Character Set is the set of Unicode codepoints that are supported — i.e., you can “type” it and it will appear in one form or another.

As you know, depending upon language localization, as well as typographic and stylistic variants, a given character may be represented my multiple glyphs.

A Glyph List may end up being equivalent to the Character Set, but more often than not the glyph list is a superset of the character set.

One could say that a Glyph List is the product of a Character Set crossed with an OT Feature Set.

IMHO.

John Hudson · January 2016

What Nick and Kent said. Since not all glyphs are encoded, it stands to reason that a glyph list and a character set are different.

Hin-Tak Leung · January 2016

I'll give an example where glyph list is smaller than character list. In Chinese, dictionaries are indexed by 'radicals', which are distinct subpart of a character shape. A glyph used as a radical vs as a standalone character, occupies different positions in the encoding chart.

Glyphs are collection of visual shapes, characters are collection of unit of typographical meanings in a language.

Richard Fink · January 2016

OK. Got it. Completely.

I like Kent's wording the best:

Glyph List is the totality of the actual glyphs in the font.

Character Set is the set of Unicode codepoints that are supported — i.e., you can “type” it and it will appear in one form or another.

Thanks!

Kent Lew · January 2016

I'll give an example where glyph list is smaller than character list.

True. There are glyphs in Latin (and others) as well that can represent multiple characters. In a development process, the glyph set can in some circumstances be smaller than the character set.

Current best practice discourages multiple-encoding in the compiled font, however. So such glyphs may be duplicated for final encoding.

In which case, it’s theoretically possible for the source glyph list to be smaller than the character set, and the compiled glyph list to be larger than the character set.

But Richard shouldn’t need to bother with that fine a distinction. ;-)

Nick Shinn · January 2016

What’s the difference between multiple encoding and referencing a component?

For instance, using the same bezier path for /endash and /minus.

Is that one glyph repeated, or two glyphs?

Is /i, composed of /dotlessi and dotaccent, one glyph or two?

And am I OK to use /slash to indicate both characters and glyphs, indiscriminately?

Hin-Tak Leung · January 2016

I resent the "character list" = "unicorn code points" idea. Not everything is unicode. In a rather large part of the world, computers sold are by law must support the national encoding, not unicode.

I also think the unicode committee got it wrong with the bulk of simplified Chinese vs traditional Chinese - they are glyph variations, not character variations.

Mark Simonson · January 2016

"unicorn code points"

Hin-Tak Leung · January 2016

Argh, my phone's autocompletion

John Hudson · January 2016

I resent the "character list" = "unicorn code points" idea. Not everything is unicode. In a rather large part of the world, computers sold are by law must support the national encoding, not unicode.

True, but software often does that by mapping from Unicode to those national encodings, and Unicode cmap tables tend to be pretty standard in fonts as a result. It's a while since I encountered a font that actually had a non-Unicode character map.

Nick Shinn · January 2016

http://emojipedia.org/unicorn-face/

Richard Fink · January 2016

Wassup with those unicorns! Pesky bastards. And horny as hell, I hear.

OK so who do we lynch for the term Windows Glyph List? Because I think that's subliminally what prompted the question that started this thread.

Why was a character set named a glyph list? To confuse everybody, of course.

@Hin-Tak Leung:

OK. I get your point. (No pun intended.) Unicode is not quite universal, But +1 to what John Hudson said.

I do have an example to bolster your argument though:

When I use a Unicode reference in a web page for the points between 128 - 159 (Decimal), a range I call "the minefield" because it's a kind of demilitarized no-man's land because before Unicode there were two competing mappings for those points depending if ANSI (Windows) or MacRoman (Mac) was in effect.

Now, if I specify the specific Unicode point 128 (&#128) in a web page and the web font that's loaded in the page has, say, an exclamation point drawn at point 128, Windows will automatically see if the font has Unicode point 8364 (Unicode Euro) defined and it will correlate that Unicode point to ANSII 128 and display the Euro at Uni point 8364 instead of the exclamation point I specifically put in the font at 128.

In other words, it sort of goes into "translate to ANSI" mode, and Windows automatically dismisses my wish that it display the exclamation point.

What the Mac does, I haven't checked. But I should, just out of curiosity.

Legacy stuff.

@nick shinn:

Thanks for those additional examples of glyph versus character. Helps explain it to the uninitiated. (Oh yeah, I forgot to mention I've been initiated. I'm a made man in the font mafia, now. Cool, huh?)

TTYL - rf

Kent Lew · January 2016

Is there a <meta charset= /> declared for your HTML or not?

The exclamation point is U+0021, which translates from hex to dec as &#33.

&#128 translates to U+0080, which is a control char. What are you expecting to happen when you put in the &#128 entity?

SiDaniels · January 2016

"Current best practice discourages multiple-encoding in the compiled font"

Wasn't this to help in indexing or searching ancient "print to PDF" PDFs? Is this still best practice? Didn't Adobe fix this?

Cheers, Si

John Hudson · January 2016

Wasn't this to help in indexing or searching ancient "print to PDF" PDFs? Is this still best practice? Didn't Adobe fix this?

It's not entirely fixable, because PDFs can still be made by distilling from a print stream so this situation can still occur, even though other methods of PDF creation — which write character strings to the PDF — are encouraged.

That said, I've never characterised the single-path-from-GID-to Unicode rule as 'best practice' for fonts, even when print stream distilled PDFs were a lot more prevelant. Rather, it is a particular production method for a particular technical requirement. When making fonts for clients, I always explain the issue to them, and ask whether it is important for them to enable accurate Acrobat text reconstruction from glyph names. Some say yes, most say no.

One hopes they say no when it's a polytonic Greek font, because the number of redundant duplicate glyphs for all-caps and smallcaps suppressed accent variants required is just silly.

Chris Lozos · January 2016

...multiple-encoding in the compiled font

This sure would help with Omega, etc math and Greek.

Hin-Tak Leung · January 2016

@SiDaniels : @John Hudson is right - the situation is not entirely fixable: when multiple encoding points corresponds to the same glyph, the reverse - from glyph (in a pdf, the 'visual' representation) back to code point (i.e. "text", a meaning piece of textual info) is ambiguous. You can probably use some contextual clues to distinguish the different alternatives of reverse mappings, but that's rather specific to language etc and difficult to implement.

So "pdf to text" , for a general non-English text, under some circumstances is difficult to do.

SiDaniels · January 2016

Thanks re the double mapping.

Richard, WGL4 is a "character set"... it says so here.

As to why WGL and not WCL? Probably lost to the sans of Times. ;-) But WGL rolls off the tongue better than WCL, don't you think?

Georg Seifert · January 2016

There are several others cases where double encoding would make sense (Arabic isolated forms...). I asked Read Roberts several times if he could fix this in makeOTF. And Adobe uses double mapping in there CID fonts, so it can’t be so bad.

Kent Lew · January 2016

Sorry if “best practice” was an overstated characterization.

For a custom font, there is the advantage of sharing the decision-making with the client in full knowledge (theoretically) of the destined environments and workflows.

But for a general retail font, I was under the impression that this was still the recommended approach.

Georg Seifert · January 2016

It is recommended for the Greek/Math glyphs and such but that shouldn't mean it is not useful for other things.

Richard Fink · January 2016

@si I had a feeling it was something like that, Si. Yes, it's more mellifluous as WGL. LOL.

As far as the Uni to ANSI switcheroo, I found out about it trying to be slick - I wanted to use a character in that range - since it's largely going unused - to brand the font for conversion into a particular format - TTF, OTF, WOFF1, WOFF2, EOT. So I could more easily find out which file listed in the web page's CSS is being loaded. But Windows reached out and protected me from my stupidly clever self.

@kent Oh, sorry, is not 128 the Euro in ANSI? Is it 129? (Anyway, whatever in that range in ANSI is the Euro is the codepoint.)

Further, all my test pages are declared UTF-8. Also, I actually write the text using html Unicode chars like  . So even if you were to choose another encoding in the browser's drop down list, nothing should change. (Or at least it doesn't for me.) I do this because when you just type HTML using an ordinary text editor - I'm really not sure what exactly causes it - you can mangle the text by selecting another encoding in the browser's drop down list.

For example: there is a suite of test pages - requiring a web server - created by Pablo Impallari which has pages for a variety of foreign languages. But unfortunately the pages don't use HTML Unicode - either decimal (less bytes, smaller page) or hex. And so, I can mangle the text by choosing another encoding in the browser's encoding menu. When I told Pablo about it he said that I should just not do that! While a terrific guy and type designer, Pablo has obviously never done any desktop support. Hah! What CAN go wrong WILL go wrong, as sure as the sun will come up tomorrow. Plus, I'm too much the scientist to settle. If I find the time, maybe I'll do a fork of Pablo's site - which I do like and is a source of inspiration for me, but unfortunately, today, its the only site with halfway systematic quality control pages for web fonts in certain languages like Tamil, Devanagari, and others.

http://www.impallari.com/testing/

That's what I've been about the past few months: trying to fill the gap in tools and knowledge with pages that don't require a web server, are mangle-proof, and test multi-lingual fonts in a more thorough way. Past few weeks, I've focused on pages that test compliance with the major character sets. I've got all the Adobe sets done.

I've got some of those pages posted in my repository on Github, but I'm struggling to figure out a versioning scheme that makes sense as I improve and expand the suite of pages. I'll be updating this weekend and hopefully I'll figure out some kind of numbering and directory structure that makes sense. But certainly feel free to poke around what's there if you want:

All suggestions welcome. I'll post about updates here and on the GWF forum as they appear. I'm convinced I'm on the right track, but it took me a little time to get the methodology - the steps in the production line - smoothed out. but it's fairly flowing now.

https://github.com/richardfink/webfont-testpages/

I'm going to rip off a test page for a web font's compliance with the Windows Glyph List today, as we now know, isn't a list but a character set.

@Georg Seifert, Chris, and all - Thanks for the observations. I haven't gotten into any Arabic test pages yet, or Math, but if the fonts are out there as web fonts, test page will be coming soon.

Kent Lew · January 2016

Um, I might be wrong, but if you declare the charset to be "utf-8" and then you enter a decimal entity, then I think that entity codepoint is going to be interpreted according to the UTF-8 charset.

So, it doesn’t matter what the decimal codepoint is in ANSI. If you input that decimal entity, it will be interpreted as the decimal equivalent of the corresponding Unicode hex codepoint.

If you want to input a Euro with an HTML entity, then you have to either use the hex &#x20AC or the equivalent dec &#8364.

Or if you want the dec entity &#128 to yield a Euro, then you have to declare the encoding for the HTML as ANSI, not UTF-8.

And if the HTML does not have a declared charset, then I assume that any decimal entities will be interpreted according to the browser default. In which case, all bets are off.

Right? Or is there something I’m missing. Web encoding is not my primary gig.

Hin-Tak Leung · January 2016

@Richard Fink I am not exactly sure a lot of what you are saying.

For example, modern we pages have their own embedded encoding declaration near the top - fiddling with browser's encoding setting should not have any effect. If it has, it is a sign of poor web authoring. Also your browser can aumatically negotiate with the remote Web site what language encoded versions to fetch, if multiple versions exist on the remote server.

As for "ordinary text editor", some might say Microsoft wordpad is one such (it is not - it is only 'ordinary' in term of being 'very common'), some might say GNU Emacs is an "ordinary text editor", which again it is not - a version or two ago it uses its own incompatible-with-everything-else encoding called MULE (multi-lingual extension) for everything non-ascii, because it came from the 1980's, and the switch to unicode internals is rather recent.

Also note that 'css' strictly speaking are 'recommendations' from the Web authors, which the browser can be configured to ignore partly or wholly.

So there is a lot I don't understand what you are actually doing/using, and what you try to achieve.

Richard Fink · January 2016

Kent, Hin-Tak, I'll be back with screenshots to show you in a day or so, maybe sooner.

Denis Moyogo Jacquerye · February 2016

HTML entities are independent of the encoding defined, they always map to what they were defined to map to in the standard (Unicode for the numerical values and subsets for the named ones).

Many browsers render  as U+20AC because too many web designers assume that’s what they’ll get. In fact the range 128 to 159 may be mapped to whatever they are in CP1252 (a superset of Latin 1) by some browsers but not by all (because they're supposed to be undefined).

Don’t use  but instead use € or €, or if you absolutely want to € or €.

Richard Fink · February 2016

Denis Moyogo Jacquerye said:

HTML entities are independent of the encoding defined, they always map to what they were defined to map to in the standard (Unicode for the numerical values and subsets for the named ones).

Yes, that explains it pretty well. But still, what does "independent of the encoding defined" mean?

Here's Pablo Impallari's test page for Kannada with Firefox's encoding menu set to, first, "Unicode" selected in the encoding menu and next "Western" selected in the encoding menu. (Easier to see if you zoom up a little.) As you can see the Kannada chars are fine with Unicode selected. But I can mangle it completely by selecting another encoding. In this case, Firefox's "Western" encoding. (What the "Western" encoding is within Firefox, I have no idea. No ISO number is listed for it.)

So, what Denis and I are saying is that if you designate the text using an HTML unicode point OR, as Denis points out, one of the HTML "entities" like & then you override the encoding chosen because the underlying HTML is saying "only these Unicode points, please". It's thereby "independent" of the encoding chosen.

The Terms "Glyph List" versus "Character Set"

Comments

Categories