Platform and platform encoding identifiers: Getting it straight

I recently did a quick and dirty survey of the cmap tables of all the fonts installed on my computer. Here are the results:

I'm not entirely sure what to draw from this, other than "put (0,3), (1,0) and (3,1) tables in your fonts, and (3,10) if you need codepoints outside the BMP." Is that a fair statement?
  •  Is (0,10) really a thing? I can't find it in the Unicode platform section on the "name" table page. The text on the "cmap" page says very little about platform 0, and should probably say more.
  • There are lots of platform 1 encodings. But I don't see why there should be. The "cmap" table spec says that "When building a font that will be used on the Macintosh, the platform ID should be 1 and the encoding ID should be 0." "Should" is not the same as "must", of course - and more encodings are listed on the name table page, which might be confusing people.
  • Similarly "when building a Unicode font for Windows, the platform ID should be 3 and the encoding ID should be 1." I'm guessing this is legacy language that wasn't updated after the creation of encoding ID 10. The language "Microsoft strongly recommends using Unicode 'cmap' subtables for all fonts" should probably be clarified - I think it means Unicode encoding, not Unicode platform, but since the text follows the description of the Unicode platform and the Unicode encoding has not been introduced at this stage, it's unclear.
The code I used to generate the survey information is available here.

Comments

  • Simon Cozens
    Simon Cozens Posts: 741
    Other thoughts (taking into account name table):
    • Are Macintosh "script manager" encodings still useful technology? I can't find any reference to these encodings. Shouldn't these just all be deprecated (like the Microsoft script encoding) and people encouraged to use Unicode for everything?
    • Platforms in the name table spec page are listed in the order 0, 3, 1, 2, 4. Is there a sensible reason for this, or someone just trying to mess with my head?
    • Should platform 4 also be deprecated, given it was only there for Win NT support?
    I'm thinking this whole thing can and should be drastically simplified - both the presentation of it and the content.

  • Simon, you make some good points. As you've pointed out, from this point forward we mostly need (3,1) and (3,10) cmaps. Your analysis has pointed out that we have "cmap debris" that has accumulated over 2 decades. It helps to understand that during the long, slow transition from code pages & Mac character sets to Unicode, it was necessary to load fonts with various cmaps to enable access to characters in various environments. Similarly, CJK fonts were often equipped with cmaps for national character sets, as well as Unicode. As for the MacRoman cmap, it's become a habit/tradition to include it, but I don't think its absence will lead to failure. Since Kozuka Gothic is a fairly recent product that includes upper-plane Japanese characters supporting JIS X 0213, it uses a (3,10) cmap. I agree that the OT spec needs to be updated to reflect real needs. Over the past 5 years or so, everyone has been so occupied with the additions for variable fonts, that little capacity is left over for other issues.
  • Simon Cozens
    Simon Cozens Posts: 741
    Yes, I understand the historical baggage (I well remember having to convert text between JIS, SJIS and EUC) - but it's time to clear it up, at least for new font producers.

    I'm surprised you suggest going for (3,1) and (3,10), though - and now that I think about it, I'm not sure why (3,1) and (3,10) are even there. The idea that we need a platform-specific "Windows" implementation of Unicode encodings (and we should prefer that over a platform-neutral encoding) seems to go against the whole point of what Unicode is about.
  • No intention to emphasize the 3 in (3,1) and (3,10). It's become the de facto Unicode platform, but of course we should define and use a platform-neutral entry instead.
  • I'm not entirely sure what to draw from this, other than "put (0,3), (1,0) and (3,1) tables in your fonts, and (3,10) if you need codepoints outside the BMP." Is that a fair statement?
    Not entirely.  https://developer.apple.com/fonts/TrueType-Reference-Manual/RM06/Chap6cmap.html gives the search order for tables related to your suggested scheme as (0,4), (0,3), (3,10), (3,1), (3,0).  It deprecates the Macintosh platform and encoding scheme!  I suggest normally reduce the needed set to (0,4) or (0,3); (3,10) if supplementary characters are supported; and (3,1).  Additional tables may be needed for variation sequences; I fear there may be issues with Mongolian variation sequences - I'm not sure how uniformly format 14 will work for them.  Older fonts implemented them by ligature substitutions.

    It seems to me that (0,3) should have format 4, though coverage will occasionally justify using other formats.  Format 4 has the advantage that the table can be shared with (3,1).  Format 4 has the disadvantage that if idDelta[] is non-zero, then iOS at least would misread it.  (I was flabbergasted to see such a bug on iPhone6 when testing a Sinhala font back in June.)

    I am not at all sure how a font for Egyptian Hieroglyhics would use a (1,x) mapping.  I've always been worried that adding a (1,0) encoding could cause it to fail for characters with no mapping for the Macintosh platform.



  • As with Kamal, no intention of empasizing the 3 in (3,1) and (3,10), but we can have high confidence that those are the most widely supported. I think there's no need for any others.

    That said, most of the size of the cmap table is in the actual subtables; the encoding records are only 8 bytes. A (3,1) record and a (0,3) record can certainly point to the same format 4 subtable. Likewise, (0,4) and (3,10) records can point to the same format 12 subtable.