Platform and platform encoding identifiers: Getting it straight

I recently did a quick and dirty survey of the cmap tables of all the fonts installed on my computer. Here are the results:

I'm not entirely sure what to draw from this, other than "put (0,3), (1,0) and (3,1) tables in your fonts, and (3,10) if you need codepoints outside the BMP." Is that a fair statement?
  •  Is (0,10) really a thing? I can't find it in the Unicode platform section on the "name" table page. The text on the "cmap" page says very little about platform 0, and should probably say more.
  • There are lots of platform 1 encodings. But I don't see why there should be. The "cmap" table spec says that "When building a font that will be used on the Macintosh, the platform ID should be 1 and the encoding ID should be 0." "Should" is not the same as "must", of course - and more encodings are listed on the name table page, which might be confusing people.
  • Similarly "when building a Unicode font for Windows, the platform ID should be 3 and the encoding ID should be 1." I'm guessing this is legacy language that wasn't updated after the creation of encoding ID 10. The language "Microsoft strongly recommends using Unicode 'cmap' subtables for all fonts" should probably be clarified - I think it means Unicode encoding, not Unicode platform, but since the text follows the description of the Unicode platform and the Unicode encoding has not been introduced at this stage, it's unclear.
The code I used to generate the survey information is available here.

Comments

  • Simon CozensSimon Cozens Posts: 445
    Other thoughts (taking into account name table):
    • Are Macintosh "script manager" encodings still useful technology? I can't find any reference to these encodings. Shouldn't these just all be deprecated (like the Microsoft script encoding) and people encouraged to use Unicode for everything?
    • Platforms in the name table spec page are listed in the order 0, 3, 1, 2, 4. Is there a sensible reason for this, or someone just trying to mess with my head?
    • Should platform 4 also be deprecated, given it was only there for Win NT support?
    I'm thinking this whole thing can and should be drastically simplified - both the presentation of it and the content.

  • Simon, you make some good points. As you've pointed out, from this point forward we mostly need (3,1) and (3,10) cmaps. Your analysis has pointed out that we have "cmap debris" that has accumulated over 2 decades. It helps to understand that during the long, slow transition from code pages & Mac character sets to Unicode, it was necessary to load fonts with various cmaps to enable access to characters in various environments. Similarly, CJK fonts were often equipped with cmaps for national character sets, as well as Unicode. As for the MacRoman cmap, it's become a habit/tradition to include it, but I don't think its absence will lead to failure. Since Kozuka Gothic is a fairly recent product that includes upper-plane Japanese characters supporting JIS X 0213, it uses a (3,10) cmap. I agree that the OT spec needs to be updated to reflect real needs. Over the past 5 years or so, everyone has been so occupied with the additions for variable fonts, that little capacity is left over for other issues.
  • Simon CozensSimon Cozens Posts: 445
    Yes, I understand the historical baggage (I well remember having to convert text between JIS, SJIS and EUC) - but it's time to clear it up, at least for new font producers.

    I'm surprised you suggest going for (3,1) and (3,10), though - and now that I think about it, I'm not sure why (3,1) and (3,10) are even there. The idea that we need a platform-specific "Windows" implementation of Unicode encodings (and we should prefer that over a platform-neutral encoding) seems to go against the whole point of what Unicode is about.
  • No intention to emphasize the 3 in (3,1) and (3,10). It's become the de facto Unicode platform, but of course we should define and use a platform-neutral entry instead.
Sign In or Register to comment.