I recently did a quick and dirty survey of the cmap tables of all the fonts installed on my computer. Here are the results:
I'm not entirely sure what to draw from this, other than "put (0,3), (1,0) and (3,1) tables in your fonts, and (3,10) if you need codepoints outside the BMP." Is that a fair statement?
- Is (0,10) really a thing? I can't find it in the Unicode platform section on the "name" table page. The text on the "cmap" page says very little about platform 0, and should probably say more.
- There are lots of platform 1 encodings. But I don't see why there should be. The "cmap" table spec says that "When building a font that will be used on the Macintosh, the platform ID should be 1 and the encoding ID should be 0." "Should" is not the same as "must", of course - and more encodings are listed on the name table page, which might be confusing people.
- Similarly "when building a Unicode font for Windows, the platform ID should be 3 and the encoding ID should be 1." I'm guessing this is legacy language that wasn't updated after the creation of encoding ID 10. The language "Microsoft strongly recommends using Unicode 'cmap' subtables for all fonts" should probably be clarified - I think it means Unicode encoding, not Unicode platform, but since the text follows the description of the Unicode platform and the Unicode encoding has not been introduced at this stage, it's unclear.
The code I used to generate the survey information is available
here.
Comments
- Are Macintosh "script manager" encodings still useful technology? I can't find any reference to these encodings. Shouldn't these just all be deprecated (like the Microsoft script encoding) and people encouraged to use Unicode for everything?
- Platforms in the name table spec page are listed in the order 0, 3, 1, 2, 4. Is there a sensible reason for this, or someone just trying to mess with my head?
- Should platform 4 also be deprecated, given it was only there for Win NT support?
I'm thinking this whole thing can and should be drastically simplified - both the presentation of it and the content.I'm surprised you suggest going for (3,1) and (3,10), though - and now that I think about it, I'm not sure why (3,1) and (3,10) are even there. The idea that we need a platform-specific "Windows" implementation of Unicode encodings (and we should prefer that over a platform-neutral encoding) seems to go against the whole point of what Unicode is about.
That said, most of the size of the cmap table is in the actual subtables; the encoding records are only 8 bytes. A (3,1) record and a (0,3) record can certainly point to the same format 4 subtable. Likewise, (0,4) and (3,10) records can point to the same format 12 subtable.