Encoding the unencoded

James Montalbano · October 2023

A client is requesting that I add the glyph schwagrave and Schwagrave to a custom font I made for them. Neither of these glyphs has proper unicode. What would be best practice to provide an encoding?

An encoding is required otherwise the unencoded glyph will generate an error message when making an Accessable PDF.

I suppose I could use the Private Use Area of Unicode, but was wondering if there was a better alternative.

John Hudson · October 2023

These are encoded in Unicode as sequences of base plus combining mark. This means that there are two possible ways to implement at the font level.

1. Using dynamic mark positioning with anchors in the GPOS mark feature. This has the benefit of alllowing arbitrary combinations of base glyphs and marks, e.g. schwa+dieresiscomb, schwa+brevecomb, etc. as well as the desired schwa+gravecomb.

2. Using precomposed diacritic glyphs mapped from the input sequence in the GSUB ccmp feature. This has the benefit of more easily integrating into kerning with appropriate distance of the mark from preceding letters such as T V W Y.

For either solution, the font will need to support the combining mark characters (U+0300 for the combining grave), and you will likely want .cap variants of the combining marks for use above uppercase letters.

John Hudson · October 2023

For more explanation of combining marks, see my short talk from the Ezhishin conference: The Web of Text.

Dave Crossland · October 2023

I have been hearing from African font users that "Unicode doesn't support X language" because they expect legacy encoding and haven't come across the explanation John just gave about how Unicode does support all African Latin languages today.

Unicode People Should Do Something 😆

Thomas Phinney · October 2023

By “legacy encoding” I assume @Dave Crossland means not a single-byte codepage, but a “flat simple text model” in which one logical character of the language = one codepoint. However, Unicode only does that for legacy encodings that pre-date Unicode. For later additions, if it uses something like diacritics, Unicode expects fonts and shaping engines to handle base-character-plus-combining-diacritic-character(s) as needed.

Some folks with affected languages have taken the Unicode approach as an offense against their language and culture, and attempted workarounds that seem doomed to failure (e.g. https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding)

John Hudson · October 2023

To be fair, Unicode People have been responding in the same way to the same questions for almost forty years. I think the questions won’t go away until implementers in the input and text editing areas get a better handle on interchange between the text encoding and the user experience.

Simon Cozens · October 2023

If it matters whether a character is atomically encoded or encoded through a decomposed sequence, then something has gone wrong with the implementation.

For example, I don't think it's true that Accessible PDFs require atomic encodings for characters if you put the source "words" into ActualText attributes. If your PDF creation library isn't doing that, that's where the problem is; not with Unicode.

So I'd be interested in which situations these African font users are seeing where having an atomic encoding would make a difference - because those situations are software bugs.

James Montalbano · October 2023

Thank you for your informative presentation @John Hudson
My client wants a prcomposed glyph that they can select from the Glyph palette. The font will only be used in a tight universe of users so I will proceed with using a PUA codepoint so the glyph behaves when making an Accesible PDF document

John Hudson · October 2023

If a precomposed glyph is mapped in the ccmp feature from the underlying base+mark sequence, that relationship should be captured via glyph palette input, so what ends up written to the document—and hence to the PDF—should be the correct Unicode. I say ‘should’, because Adobe’s glyph palettes are a bit finicky, and will sometimes insert raw glyph IDs into the text instead of underlying Unicode sequences.

If the purpose of an Accessible PDF is to e.g. support screen readers for vision impaired users, then using PUA encoding might only technically overcome the PDF-creation hurdle: it won’t make the resulting PDF actually accessible, since a screen reader will have no way to know how to interpret the non-standard codepoint.

John Savard · October 2023

Thomas Phinney said:

By “legacy encoding” I assume @Dave Crossland means not a single-byte codepage, but a “flat simple text model” in which one logical character of the language = one codepoint. However, Unicode only does that for legacy encodings that pre-date Unicode. For later additions, if it uses something like diacritics, Unicode expects fonts and shaping engines to handle base-character-plus-combining-diacritic-character(s) as needed.

Some folks with affected languages have taken the Unicode approach as an offense against their language and culture, and attempted workarounds that seem doomed to failure (e.g. https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding)

Yes. That is exactly what they mean, and I think it's perfectly reasonable for people who speak any language whatever to expect to be able to use computers with it with just as much ease and convenience as, say, the speakers of, say, French or German.

However, while it is not unreasonable to these people to have these languages supported by a local character encoding designed for them on their own computers, that it may be unreasonable to expect such support to be included in Unicode is another matter. That isn't a matter of supporting one language, but of supporting all languages fully.

For the computer to translate characters into overstrikes for communicating on the Internet, while still storing them in single bytes internally for local processing, so that computing in the local language isn't burdened by excessive overhead, ought not to be too much of a burden.

Oh, by the way, your link is broken. But this one worked:

https://en.wikipedia.org/wiki/Tamil_All_Character_Encoding

which, I admit, doesn't seem to make sense.

James Montalbano · October 2023

Since I've never written a ccmp feature, I would welcome any guidance on how to do it. Looking around the usual sources, the explainations are a bit dense.

James Montalbano · October 2023

I just received some excellent advice from Karsten Luecke, that explains it very clearly:

feature ccmp {
sub Schwa grave by Schwagrave;
sub schwa grave by schwagrave;
} ccmp;

Dave Crossland · October 2023

The link had a trailing parenthesis which broke it.

Thomas is right.

Madness is doing the same thing over and over again and expecting a different result. Hopefully Unicode can evolve.

John Hudson · October 2023

Hi James,

feature ccmp {
sub Schwa grave by Schwagrave;
sub schwa grave by schwagrave;
} ccmp;

Note that the glyph name /grave/ is usually assigned to the legacy ASCII spacing character U+0060. For the ccmp feature, you want to make sure that it includes the combining mark /gravecomb/ U+0300, which is the correct Unicode input for the sequence. So

feature ccmp {<br>sub Schwa gravecomb by Schwagrave;<br>sub schwa gravecomb by schwagrave;<br><span>} ccmp;</span>

It may be tempting to double-encode the /grave/ glyph as U+0060 and U+0300, but combining marks are usually zero-width, so better handled as separate glyphs.

John Hudson · October 2023

Hopefully Unicode can evolve.

The core spec can’t change the way combining mark sequences are handled and can’t add any more characters with canonical decompositions. But Unicode’s CLDR project, which compiles locale-specific data such as sort order, date and time formatting, etc. has recently branched out into keyboard layouts, and is working on a standardised keyboard description data format which will enable user communities to define keyboards, submit them to CLDR, and have them picked up automatically and implemented in operating system and other software.

This year, CLDR also formed a new subcommittee for Digitally Disadvantaged Languages. Unfortunately, their meeting schedule regularly conflicts with other things for me, so I have not been able to be actively involved.

John Hudson · October 2023

FWIW, I think this is primarily a user experience problem, not a text encoding or font handling problem: we already know how to encode arbitrary diacritics based on base+mark sequences, and we already know how to display them in fonts (although as James’ query indicates, best practices for handling combining marks still need broader communication). Addressing input methods through intelligent keyboard design—including the kind of features currently available only in Keyman and absent from the simpler models employed in typical Windows and Mac system keyboards—is an important part of the UX solution, but ultimately I think there needs to be more standardisation around text editing, such as tracking input units so that if, for example, a user enters the character sequence ə̀ from a single keystroke, it should be deletable from a single keystroke also. Language users have expectations based on how they learn to understand the content of their writing systems as units, and the text editing experience should be able to respond to those expectations.

John Hudson · October 2023

[On the subject of African language diacritics, I saw this in the London Review of Books yesterday:

Image: https://us.v-cdn.net/5019405/uploads/editor/ay/c7gm3hz2g4ws.png

This is Fred Smeijers’ Quadraat typeface, which the LRB has been using since soon after it was first released by FontFont in the 1990s, in PS Type 1 format. It’s nice to see that as well as beng updated to OT format it has been extended with additional diacritics such as ẹ (U+1EB9) and ọ (U+1ECD), but note that the combining acute accent is misplaced after the latter in the sequence ọ́. While precomposed ccmp mappings can be convenient for known targets such as James needs to support, dynamic GPOS positioning is far more flexible in being able to handle arbitrary and unanticipated diacritics such as this.]

Igor Freiberger · October 2023

New precomposed characters with floating above/below diacritics (like acute accent above or dot accent below) won't be added to Unicode. But new precomposed characters with diacritics over the base letter are still possible.

I suppose this is because these combinations are more difficult to handle using base+mark sequences, but it's only a guess. Maybe John Hudson or Denis Jacquerye could provide the exact information.

One example of diacritic over base letter is uniA7CB and uniA7CC, approved for inclusion in the next Unicode version and used in Luiseño and Cupeño languages:

Image: https://us.v-cdn.net/5019405/uploads/editor/l5/b48dwsna1isu.jpg

Regarding the ccmp feature, FontLab 8 is able to automatically build the code based on combinations it finds in the font. You may need to expand the code in order to catch additional composites, but it's a very good start.

Simon Cozens · October 2023

John Savard said:
However, while it is not unreasonable to these people to have these languages supported by a local character encoding designed for them on their own computers, that it may be unreasonable to expect such support to be included in Unicode is another matter. That isn't a matter of supporting one language, but of supporting all languages fully.

This is just completely wrong, and borderline FUD. African languages without atomic encodings (things like Ɛ́) are just as well supported in Unicode as legacy encodings like é, and I've just proved that by using them both in the same Unicode document.

Again, if it matters that one has a multi-character representation and that other doesn't, something's gone wrong with your client software. File a bug there.

John Hudson · October 2023

But new precomposed characters with diacritics over the base letter are still possible.

Yes, but critically they will not have canonical decompositions: they are encoded only as atomic characters, despite the existence of a number of combining mark overlay characters.

I suppose this is because these combinations are more difficult to handle using base+mark sequences, but it's only a guess.

They’re difficult to display using combining mark glyph sequences, so since they effectively require precomposed glyph handling there isn’t a net benefit to being able to encode them as base+mark character sequences. Such characters are considered on a case-by-case basis: while Unicode now has a preference for atomic encoding of diacritics involving overlays, I think if there were significant existing text encoding practice involving overlay mark sequences in a language, that might be grounds to avoid an atomic encoding. Note that absence of canonical decomposition means that when an atomic overlay diacritic is encoded a similar decomposed sequence becomes a potential security risk, so the characters may be restricted in their use in e.g. domain names.

Denis Moyogo Jacquerye · October 2023

@James Montalbano

My client wants a prcomposed glyph that they can select from the Glyph palette.

If the characters they need are frequently used, the glyph palette or character picker is a very poor solution. The should look into using keyboard input:

- A language specific keyboard layout solution like Keyman which may already have an adequate layout or also Microsoft Keyboard Layout Creator for Windows or Ukelele for macOS.

- Use text replacement with an alias to be replaced for the characters you need. This can be done in System Settings > Keyboard > Text Input > Text Replacements on macOS for example, but can also be done by a simple search and replace every once in a while.


feature ccmp {
  sub Schwa gravecomb by Schwa_gravecomb;
  sub schwa gravecomb by schwa_gravecomb;
} ccmp;

I’d recommend using the names 'Schwa_gravecomb' and 'schwa_gravecomb' instead. It’s 2023 but depending on the application generating the PDF, if they don’t add Unicode text data the glyphs will be gibberish for copy-paste or search in the PDF.

I suppose this is because these combinations are more difficult to handle using base+mark sequences, but it's only a guess.

It’s most likely for legacy reasons. Characters like đ weren’t decomposed to start with.

It’s not recommended to use combining overlay characters for critical language stuff as the position or shape may be inadequate or difficult as you say. At the same time, existing characters can be prototypical and allow some variation as different glyphs are used across different documents.

John Savard said:
However, while it is not unreasonable to these people to have these languages supported by a local character encoding designed for them on their own computers, that it may be unreasonable to expect such support to be included in Unicode is another matter. That isn't a matter of supporting one language, but of supporting all languages fully.

This is just completely wrong, and borderline FUD. African languages without atomic encodings (things like Ɛ́) are just as well supported in Unicode as legacy encodings like é, and I've just proved that by using them both in the same Unicode document.

Ironically, the font displaying Ɛ́ here doesn’t handle it and the combining acute is mispositionned. Users understandingly confuse this, or the lack of input methods, with Unicode not supporting their language.

Mark Simonson · October 2023

The character seems to be missing in the font used on TypeDrawers, so it depends on the fallback font your web browser uses. Looks okay in Safari.

Igor Freiberger · October 2023

In Firefox:

Image: https://us.v-cdn.net/5019405/uploads/editor/bj/0wok3pqau1kn.jpg

John Butler · October 2023

I use Firefox on Windows myself and have reproduced the issue. It does seem to render correctly on Chromium-based Microsoft Edge, so this appears confined to Firefox’s renderer. (Renderer? Shaping engine?)

Thomas Phinney · October 2023

Shaping engine and/or fallback font’s functionality, perhaps?

I say that because (1) the screen shots above seem to show the same base glyph and diacritic, working in one browser/platform but not the other, yet (2) on my system with Chrome on Mac, I get the same Noto font as the screenshots for regular text, but a completely different base glyph and diacritic glyph for that accented combo. Huh.

Damn, if we are not 100% sure what is happening, what hope does the average user have?!

Andreas Stötzner · October 2023

Thomas Phinney said:

…
Damn, if we are not 100% sure what is happening, what hope does the average user have?!

That is the very point of it. The average user sees a mess and just thinks ‘sh˙t why don’t I get the bloody character I need?’.

The concept of sequence encoding is a nice one in theory. But it will only work in practice reliably when the combination and placement of two components will work in a general standardized and automated way. As long as every font / text engine / whatsoever works differently on the way to the final output, the end user remains left in the nowhere. 99,99% of people can’t buy themselves anything for a dozen of specialist people feel good about holding up their believes and sticking to their principles.

Moreover, on a cultural level the current situation also reveals a problematic inherited ‘colonialism’ aspect. Need French accented ch.s? Here they are. Need Spanish or Portugese ones? Here they are. Need German, Turkish, Polish, Vietnamese? Here they are. “But these have been pre-Unicode legacy encodings, hence…” is merely a cheep excuse. Now comes in some African guy: “hey, where are the ones I need?” “Help yourself” replies the (mainly English speaking) tech community. (Anglophone natives don’t feel a need for accents at all, that has some merits of its own but it doesn’t seem to help the rest of the world.)

My idea to solve the problem would look like this:

a) a keyboard which gives immediate and easy access to any diacritic of a given writing system,

b) a keyboard which gives immediate and easy access to any base character of a given writing system,

c) a background technique which produces any combination required for output, in a reliable way.

As simple as this sounds, it seems still to be too complicated to get achieved by the technical means we have … why is that so?

Igor Freiberger · October 2023

Agreed. And, to get this, we need dynamic keyboards —I mean, with keys that are mini-displays changing after a given trigger. I once thought that dynamic keyboards are near for MacBooks, but Apple stopped to be innovative before it became reality. Maybe someday…

Image: https://us.v-cdn.net/5019405/uploads/editor/s7/dztfv11z8awt.jpg

I use a custom keyboard map that produces a bit of Andreas idea:

Image: https://us.v-cdn.net/5019405/uploads/editor/3p/7st14zjb6mr3.png

Image: https://us.v-cdn.net/5019405/uploads/editor/j0/rcd9l6mb1jag.png

Image: https://us.v-cdn.net/5019405/uploads/editor/4y/odie8ldywpuc.png

John Savard · October 2023

Simon Cozens said:

Again, if it matters that one has a multi-character representation and that other doesn't, something's gone wrong with your client software. File a bug there.

That's all well and good for the situation it describes.

However, if the situation is:

I want to write software that supports my local language;

and I want it to be just as easy and simple for me to write that software as it was for other people to write similar software with the same functionality that ran on machines like, say, the IBM 1401, which supported English...

...then the fact that multi-character representations are present does create an issue, and not because some pre-existing software is inadequate.

Having a non-legacy representation for one's language is to be at a disadvantage compared to people not facing that issue. To me, this is an obvious fact, and stating it is not creating FUD.

It may not be reasonable to expect the Unicode Consortium to solve this problem for everyone (by changing its rules, and then doing a huge amount of extra work) but that doesn't make the problem insoluble.

You just write your own operating system for the use of people who speak your language; that operating system uses a character code which does include a legacy encoding for your language, which is then used for all routine internal data processing purposes on the computers involved... and translation into Unicode, where multi-character representations are required is done for such exotic tasks as accessing the Internet.

Since "write your own operating system" can simply mean port Linux (or BSD!)... it's not impossible, although porting either to a character set other than ASCII is something I'm not aware of anyone having tried. (Although if your "new character set" can look like a member of the ISO 8859 family, that is, a set of 256 characters of which the first 128 are those of ASCII, then one wouldn't necessarily have to resort to a rather difficult bootstrap process to get started.)

Of course, one may be even more unreasonable, and instead of being willing to use a special dialect of Linux, one may want it to be easy to write software supporting one's language in Microsoft Windows. Or, to be even less reasonable, one may want every piece of software ever written that supports the English language to support, without modification, one's own language.

Oh, come on! Surely that's impossible! Well, if one uses Q, X, and Z as escape characters, and puts a translation layer into dialog boxes, text boxes, and printing... possibly something could be worked out.

In any case, from the way I view the world, it seems to me that the people who think that not supporting legacy-style encodings for the new languages on the block (where they are not, in fact, legacies from the past) is just fine... are just asking to open their newspapers one fine morning to read something like the following... Today, Burma, India, Thailand, Laos and 59 other countries have agreed to adopt the new international standard BRICSCODE instead of Unicode to better reflect the post-colonial reality of a multipolar world.

And the decisions of the BRICSCODE consortium are made in Beijing, Moscow, Teheran, and maybe even Pyongyang.

So, in my view, the Unicode Consortium is turning a blind eye to the political realities that may determine if Unicode is even permitted to survive as the dominant standard international character encoding. Some of the little countries of the world are... touchy... about having full equality with all other sovereign states, and in some cases for understandable reasons given their history.

Andreas Stötzner · October 2023

I propose a technical word count limit for postings.

James Montalbano · October 2023

@John Hudson Note that the glyph name /grave/ is usually assigned to the legacy ASCII spacing character U+0060. For the ccmp feature, you want to make sure that it includes the combining mark /gravecomb/ U+0300, which is the correct Unicode input for the sequence. So

feature ccmp {<br>sub Schwa gravecomb by Schwagrave;<br>sub schwa gravecomb by schwagrave;<br><span>} ccmp;</span>

But why? I understand that this is the preferred approach, but I see this as a barrier for my client to input the proper qlyph sequence. So much easier for them to input /schwa and /grave, then hunt around for the proper non combining grave glyph. The ccmp feature works with either grave as I test it.

Or am I missing something here?

Simon Cozens · October 2023

James Montalbano said:
But why? I understand that this is the preferred approach, but I see this as a barrier for my client to input the proper qlyph sequence. So much easier for them to input /schwa and /grave, then hunt around for the proper non combining grave glyph. The ccmp feature works with either grave as I test it.

Which keyboard layout has a schwa key but no combining grave?

Encoding the unencoded

Comments

Categories