PUA Overcrowding

ClintGoss · May 2019

I am wondering why, with the vast tracts of codepoint space available up in the wilds of Plane 15 and 16, why does everyone (MUFI, SMuFL, ConScript, SIL, etc) want to crowd down into the tiny PUA island carved out in the BMP?

I (generally) get the historical argument, but PUA-A and -B have been around since Unicode v2.0 (23+ years?)

Is there a reason / rationale here? Is there a general carving-up of the PUA that I'm missing?

What causes well-tempered typographers to avoid saddling up the font tools and venturing to the wide open spaces of the higher Plains?

André G. Isaak · May 2019

I don’t know anything about ConScripts or SMuFL, but SIL and and MUFI have all been around for sometime. I would assume that the historical argument was far more relevant when they first begun, and relocating would cause significant hassles once they'd already carved out their own, often conflicting, space in the BMP PUA.

IIRC, MUFI’s PUA usage was to some extent shaped by the PUA in TITUS Cyberbit, and that’s been around since back when the Flintstones still peacefully coexisted with the dinosaurs.

Bhikkhu Pesala · May 2019

I am moving away from using PUA encodings as far as possible. It is better to use OpenType features to access alternate glyphs.

See this earlier discussion.

AbrahamLee · May 2019

For those unaware, SMuFL stands for Standard Music Font Layout. It is an attempt to do for the music notation software community what Unicode did for the multilingual typesetting community, but still comply with Unicode since most computer software expects that. However, since Unicode only defines a relatively small glyph set for music, SMuFL exists almost entirely in the PUA, containing several thousand unique music related glyphs that wouldn’t otherwise require an opentype feature to access them.

Mark Simonson · May 2019

No one can stake a claim in the PUA. Anyone using it (or "carving it up"), even for non-Unicode standards, must do so with the understanding that it can be used by other fonts for other things.

Jacob Casal · May 2019

Is there a technical difference between the PUA and PUA-A and -B? I suppose other than when one changes fonts the supplementary PUA glyphs wouldn’t change to said other font’s PUA encodings?

Thomas Phinney · May 2019

In normal apps, the encodings stay the same, and you get whatever glyphs the font has at those codepoints.

Peter Baker · May 2019

André is exactly right about MUFI, which has been coordinated with TITUS from the beginning. The current version of the TITUS Cyberbit font is downloaded from a page dated 2009, with a header "Compliant with UNICODE 4.0," i.e. 2003-5. But TITUS is older. MUFI has been around since 2001--a time when I suspect application support for the upper range was poor.

Between MUFI and TITUS, the PUA for medievalists/classicists/linguists is getting very crowded. It looks unlikely that MUFI will add much to its recommendation in the near future, but if there is ever a push to expand it, I'm sure it will push into the upper range.

ClintGoss · May 2019

André G. Isaak said:

... relocating would cause significant hassles ...

... except that maybe OT provides a straightforward (and pretty slick) upgrade path into the upper planes ... what if each PUA user:

Picks a second plot of code-point territory in the upper planes (the "upper PUA") in addition to your PUA range in the BMP (the "lower PUA") and multi-map your PUA glyphs onto both your Upper and Lower PUA ranges.

The Lower PUA still works and apps can migrate to using the Upper PUA over time. You can even incentivize the use of the Upper PUA by adding new code points in your range only in the Upper PUA.

Is this workable?? I'm likely missing some significant issues here ... but ...

The way is stands, the lower PUA is essentially the dreaded code-page scenario ... swapping contexts (fonts) to get the right set of characters that are overloading the code points.

Thomas Phinney · May 2019

“the lower PUA is essentially the dreaded code-page scenario”: Yes, quite. Except for specialized bits of stuff, where any single user is unlikely to have a major conflict at any time.

“apps can migrate”: Urm, no. General-purpose apps like Word have no knowledge or understanding of these PUA assignments, nor should they. Nor even could they for the most part, since it varies by font. (Unless they hard-code codepoint meaning on a per-font basis, and that is so not going to ever happen!)

What would need to migrate is the encoding of existing documents, in a large variety of apps. That is hard to do.

As long as the “lower PUA” still works, even any specialty app that really understands this stuff has little incentive to change how it treats new docs, either. Say that the specialty app is (for example) a music composition app. Why should it even care that Klingon overlaps with its PUA usage? If the fonts double-encode stuff, what reason would that app have to change its usage of those code points? Not saying you couldn’t convince some apps, perhaps, out of “principle,” but the functional rationale for them is... slim.

ClintGoss · May 2019

Ah, OK, Thomas ... I get it now. There's unlikely to be concurrent contention over the PUA (unless Klingons take a shine to FontAwesome icons ... and then only if they had not figured out how to switch fonts).

However, as Jacob Casal asked ...

Is there a technical difference between the PUA and PUA-A and -B?

e.g. Are there significant current apps that can handle the PUA, but not PUA-A and -B? ... or other dastardly daunting scenarios pushing us back down into the BMP?

Khaled Hosny · May 2019

Before Emoji, support for characters outside BMP was erratic; many applications and programming languages that implemented Unicode support during UCS-2 time didn’t handle UTF-16 surrogate pairs correctly. So I can understand why such PUA-using initiatives would have wanted to stay in BMP.

Sometimes I think putting Emoji outside BMP was a plot by Unicode to get as much applications as possible to support higher Unicode planes;

Before Emoji:

Users: your application does not correctly handle these mathematical symbols or rarely used CJK characters

Developers: ¯\_(ツ)_/¯

After Emoj:

Users: your application does not support Emoji

Developers: OMG! We need to fix this ASAP or we are going out of business!

Thomas Phinney · May 2019

ClintGoss said:

However, as Jacob Casal asked ...

Is there a technical difference between the PUA and PUA-A and -B?

e.g. Are there significant current apps that can handle the PUA, but not PUA-A and -B? ... or other dastardly daunting scenarios pushing us back down into the BMP?

Yeah, I missed answering this properly.

So, I agree 100% with Khaled, he has it right. For those who don’t know all the lingo, I am going to restate what he said"

The “BMP” is the Basic Multi-lingual Plane, the antique section of Unicode that can be represented as a single double-byte code point. It is all in the first 64K characters of Unicode. There are additional “planes” (64K sections), and apps have to be just a tiny bit smarter to deal with them. Not much smarter, but just a little.

Most of the stuff people actually use day-to-day is in the BMP, including the original PUA (“Private Use Area”).

BUT, there are an increasingly large quantity of emoji going outside the BMP. The reason is simple: almost everything new being added to Unicode goes outside the BMP, because the BMP is quite full. It is just that most of the new stuff being added is pretty obscure.

Emoji is the exception to that. They are outside the BMP! So suddenly apps have started caring about extending their Unicode support beyond the BMP.

This helps emoji work, but also helps enable a ton of other things! There are all sorts of things beyond the BMP: relatively new writing systems (Adlam, from west Africa); super obscure (Warang Citi, Mro, Duployan, Minoan Linear A, Phaistos Disc); extensions of rare, obsolete or historic characters for better known writing systems that are mostly in the BMP (such as for Arabic, Sinhala, Mongolian); or in a few cases, languages we have all heard of but hardly anybody actually uses (Egyptian Hieroglyphs, Cuneiform). So, anything that happens to be a real latecomer to Unicode.

ClintGoss · May 2019

Thank you all! ... this has be really helpful ...

I'll sprinkle some characters at the nosebleed end of PUA-B and try it with some common apps on Win7x64 to see what breaks ...

ClintGoss · May 2019

I've done some very basic testing of four "High PUA" characters: U+10ff00 through U+10ff03 on a small flock of apps.

Most of the MS apps (Word, Excel, PowerPoint, WordPad) and Acrobat work as expected and inter-operate nicely. Windows Explorer shows TOFU, but handles cut and paste correctly.

Windows Character Map fails.

Corel Draw works internally, but does not inter-operate (cut/paste) between any other apps.

** The Details

The OS is Microsoft Windows 7 Pro 6.1.7601 SP1 x64

All Microsoft ("MS") Office apps are from the suite Office 365 MSO (16.0.11601.20130) 32-bit.

Versions of other applications (the circa dates are based on the copyright notice):

Corel Draw X8 v18.1.0.661 (circa 2016).

Adobe Acrobat 9 Pro version 9.0.0 (circa 2008).

Windows Explorer mostly works.

It shows TOFU for a file name with High PUA characters, but handles cut/paste operations between MS Word, correctly preserving the code points.

Windows Character Map fails.

Does not seem to show any characters above U+FFFD.

MS Word works.

Microsft Word for Office 365 MSO (16.0.11601.20130) 32-bit correctly handles converting High PUA characters using alt-X (e.g. converting 10ff00 followed by alt-X to a U+10ff00 character). It and then treats them like other characters. On Save-As/PDF, writes a .pdf with embedded fonts, which looks OK in Acrobat.

Adobe Acrobat works.

It views the document written by MS Word. Cut from PDF and paste into MS Word works.

Corel Draw fails cut/paste between all MS Office apps, but works internally.

Corel Draw X8 v18.1.0.662

Fails - shows .notdef for High PUA characters pasted from MS Word, but allows selection of High PUA characters from it's internal Insert Character docker and treats those characters reasonably thereafter.

MS Excel works.

Microsoft Excel for Office 365 MSO (16.0.11601.20130) 32-bit

Handles cut-paste to/from other MS Office apps.

Fails cut-paste to/from Corel Draw.

MS WordPad works.

Handles cut-paste to/from other MS Office apps.

Saves/restores correctly to/from "Unicode" .txt (writing a BOM and UTF16).

Saves and restores to/from .rtf format.

MS PowerPoint works.

Handles cut-paste to/from other MS Office apps.

Fails cut-paste to/from Corel Draw.

It also has its own rules for line spacing, but that's another issue ...

PUA Overcrowding

Comments

Categories