Why do we need two:

AGL (or AGLFN), and Unicode?

Comments

  • If we send Adobe to Room 101, what are we left with and will it work everywhere that it needs to?
  • David, I recommend you look at the specification: http://sourceforge.net/adobe/aglfn/wiki/AGL Specification/

    What it does, in essence, is establish a grammar that gets us from a sequence of glyph names to a Unicode character string and back. As Ken Lunde writes: "The mapping is meant to convert a sequence of glyph names to plain text while preserving the underlying semantics. For example, the glyph name for 'A' the glyph name for 'small capital A,' and the glyph name for a swash variant of 'A' will all be mapped to the same UV (Unicode Value). This is useful in copying text in some environments, and is also useful for doing text searches that will match all glyph names in the original string that mean 'A.'"

    The specification also gives us rules for naming composite glyphs and .alt glyphs that are outside the established Unicode. In liturgical and biblical Hebrew, in which I do a lot of work, there are things missing (and essentially incorrect) in the Unicode that require this treatment, at least for now. Some things of interest only to the most specialized groups probably will be always. I'm not sure these conventions originated with Adobe, but at least they're there to keep track of them.

    So, without something like AGL, we'd be back in Tower of Babel full time, not just for visits, as we are now.
  • Data redundancy is often used as a mechanism that minimizes data loss. This particular case has to do with how PostScript was originally conceived as a non-archival "one way" format (that you just send to a printer or imagesetter and receive a printout), then received a little brother named PDF which suddenly enjoyed tremendous success as a document archival format, even though it's not really very well engineered to be that.

    Originally, PostScript primarily stored the stream of glyphs in relation to a particular font. The actual semantic value of that glyoh stream was of little or no value. Originally, PDF did that as well, but some mechanism were later added which allowed to store a semantic text layer along with the glyph stream — so that, for example, text search would work inside a PDF.

    Since that was somewhat of an afterthought, a number of mechanisms exist: an entire new data structure that stores the Unicode text strings that accompany the glyph streams, and a "glyph to Unicode" mapping structure which can be used to reconstruct the text stream.

    But those mechanisms are optional, and many PDF creation tools don't embed them -- especially if the PDFs are created through conversion of PostScript, which does not have such mechanisms.

    So, as an additional failsafe mechanism, using "meaningful" glyph names is recommended. Meaningful means: one that can be used by a computer to reconstruct the Unicode text string for a given glyph stream. And that's where AGLFN and the uniXXXX notation is used. It's the last resort of information which allows text search or text-to-speech even if a PDF does not inlcude any of the higher-level structures.
  • ScottMartin: "...we'd be back in Tower of Babel full time, not just for visits..."

    Louder!* You visit where I live, i.e. 100 purposes like yours, with all due... = Babel for us. . .*

    Adam:
    "This particular case has to do with how PostScript was originally conceived..."

    That was all great! wasn't it? Fine format!

    [and] "It's the last resort..."

    But that was in a pre-internet age from when dinosaurs roomed with Bush at Yale and Warnok worked. As type drawers... (sourcing)... we (each) need to have tools that present the, (or a), human-'readable glyph name stream, and then who cares, let the unicode (+ tables), be the compiled version (font format o choice). I never want to see unicodes... ever, theys a komputyr thing and will never be either complete or eloquent, typographically, * and the redundancy has become deafening.

    So, like that kind of 'we', need one.





  • Thomas Phinney
    Thomas Phinney Posts: 2,887
    Well, if you don't mind that PDFs made from your fonts may not always get the underlying text represented correctly (for searchability, copy/paste, even indexing I suppose), then you can just ignore the issue.

    Also, older versions of Mac OS X ignored the cmap of OTFs and used glyph names instead to determine the encoding. Not sure exactly when they stopped doing that. I am going to ask as I am curious.

    If the benefits of using arbitrary glyph names (or none, for TTF), or not bothering with a final name vs working name distinction seem larger than the costs I mention above, then you should ignore AGL and AGLFN.

    Of course, I am viewing this in pragmatic terms. You may have other reasons for choosing one of the other.
  • John Hudson
    John Hudson Posts: 3,190
    If you are making TTFs, and only care about names during production — i.e. do not care about downstream text reconstruction from distilled PDFs etc. — then you can make your fonts smaller by shipping with a format 3 post table, which contains no names at all. This might be an attractive option for webfonts.

    [I use DTL OTMaster to change to a format 3 post table when I need to. I presume there are other tools one could use. OTMaster is nice because while you are working on a font it preserves the names, so you can switch back and forth between different post table formats without losing data, until you save the changes.]
  • John Hudson
    John Hudson Posts: 3,190
    PS. You can also toggle the post table format string in OTMaster to refresh the post table, so if you have, for instance, added glyphs to a font using the VTT import glyph function you can use OTMaster to bring the post table length in synch with the new glyph count.
  • Kent Lew
    Kent Lew Posts: 937
    Well, if you don't mind that PDFs made from your fonts may not always get the underlying text represented correctly (for searchability, copy/paste, even indexing I suppose), then you can just ignore the issue.
    This sidesteps the few knotty situations where Adobe does not maintain certain specific codepoints in your underlying text, regardless.

    The most noticeable example that I’ve investigated is:

    If you type option-j (∆) on a Macintosh keyboard, you will get a U+2206 (increment) codepoint inserted into your text stream. In other words, from the U.S. (and similar) keyboard, one is presumed not to be typing Greek. Seems reasonable.

    Even if you are using a font that has separate Delta and uni2206 glyphs, so-named and encoded 0x0394 and 0x2206 respectively — which, as I read it, is in compliance with the latest AGLFN v1.7 . . . If you type this ∆ in InDesign, format it with said font, and export a PDF, and then open that PDF in Acrobat and copy/paste the text, you will not get back 0x2206, but you will instead get the Greek codepoint 0x0394 in all cases (that I have been able to conceive and test).

    I have checked, and I cannot find any Unicode normalization that dictates that Increment U+2206 be normalized to U+0394 (not in the same way that U+2126 Ohm is unquestionably canonically equivalent now with U+03A9 Omega, for instance). But maybe I’m misreading or misunderstanding the various normalization recommendations?

    I have even tried this test using Quark > exporting to Postscript file > and creating the PDF with Acrobat Distiller (while the font is no longer installed in the system, fwiw) — this being the sort of workflow that I had been led to understand would lead to just such a text reconstruction from glyph names.

    The Acrobat copy/paste result is the same: Greek codepoint.

    Incidentally, copy/pasting the same text using Preview to view the PDF will lead to the same result (Greek codepoint) if the PDF was made directly from InDesign, but will retain/reconstruct the U+2206 increment codepoint if the PDF came from Quark>Postscript. So, who’s right (and who‘s following the AGL-specified computation) — Adobe or Apple?
  • Chris Lozos
    Chris Lozos Posts: 1,458
    This same Greek language glyph vs. math symbol problem has existed for some time now. Is this a keyboarding issue or a compliance issue? Does the person writing the text intend the letter to be either Greek or math? If such a text were typed on a Greek mapped keyboard, the odds are one way and if typed on a Latin keyboard, the odds are the other. How can this be solved so that authors intent maintained? Can this be a contextual solution by looking for near neighbor glyphs? If so, is the burden of correct return to intension worth the trouble?
  • Deleted Account
    Deleted Account Posts: 739
    edited June 2013
    Kent — that's just one glyph, (or maybe three). When I grew up, we didn't have increments or product or summaries, we gave Greek names and everybody loved it. I don't even know where those three came from.

    You guys keep sliding into the output names (font format o choice), which is fine. And, when it comes to the output fonts, there is an appropriate glyph identification scheme for each and every use(r), as we decided elsewhere ;)

    To capture all those uses in the mother "(sourcing)" format is the question for me. I'm on my third run at this and I'm delighted at how much easier it is, but I need to wonder something. The first time I did this (glyph name database making), I wanted to give some positioning information in the name, so A acute was not enough and became A acute above, e.g. and every single glyph had this in the name. Now, the process rests soundly on anchor data, and I don't have to go beyond A acute, but I'm e'scared that the mother tongue should have it anyway, because e.g. not everyone will be able to query the anchors to find out where the thing goes.

    And I'm wondering how many glyphs, e.g. are not named explicitly enough in the Adobe glyph names, and so are like a black box to people unlike us. One must know a lot to know, e.g. what accents normally go above, so when they are below the location is included in the name, and vice versa. And a macron is a line so where'd a line accent come from?;) look at this random grab...

    Kcaron;01E8, I know it's above.
    Kcedilla;0136, I know, it's below.
    Kcircle;24C0, I think it's above.
    Kcommaaccent;0136, where is it?
    Kdotbelow;1E32, if you say so.
    Khook;0198, below,
    Klinebelow;1E34, oh?
    Kmonospace;FF2B, ...
    Ksmall;F76B, ...

    And wtf is a "Kmonospace"? What are you callin' "small"?, It just goes on and on, imho, because of a desire for specialness I don't have. I want any dumb klutz with one 1/2 good eyeball to feel at home with glyph names, kiss the little unicodes and whatever Adobe apps need goodbye at productization without further thought... and like that, get what's good.

  • Kent Lew
    Kent Lew Posts: 937
    David — Ah, so you’re talking about maintaining longterm in sources? Got it. DJR and I touched on that the other day.

    But in that case, you probably shouldn’t be looking at the AGL as any touchstone. As most everyone has been talking about, that really is about “sliding into output names” (and is superceded by AGLFN).

    If we want something relatively standardized that is decipherable with half an eyeball ;-), then we might be better working directly off the Unicode names, which tend to be very explicit descriptions containing all the information you seem to want to capture:

    U+24C0 : Circled Latin Capital Letter K
    U+1E34 : Latin Capital Letter K with Line Below

    Of course, these get pretty long to type into spacing strings, etc., so there is a downside.

    This would be about mother source names, then, and leave the translation to output names (according to AGLFN or whatever mapping we want) for the landing pattern, as you said.
  • John Hudson
    John Hudson Posts: 3,190
    Some thoughts on source management with regard to diacritics:

    Start including the zero-width combining mark characters as glyphs in your fonts, and use these rather than the spacing accents in composites. The spacing accents — /acute/ /grave/ etc. — are a small subset of accents that exist only for backwards compatibility with old 8-bit character sets. Stick them in the font, somewhere out of the way, but start thinking in terms of combining mark characters as the source for composites as well GPOS anchor attachment (which means, of course, that you can use your composite positions in pre-built diacritics as a source for GPOS anchor attachment, or vice versa).

    David, I don't know if this is of any use to you, but below is a link to the production names and final font names for encoded glyphs in the Brill set. These are only for the glyphs mapped to Unicode values in the cmap. I can send you the complete list, including variants, if you're interested, but this is probably huge enough and some of the other naming is particular to how the Brill font was designed; variant names are easy to derive. Brill cared about accurate Acrobat text reconstruction, so there are unique glyphs for each Unicode and no double-encodings.

    Feel free to make whatever use you like of this:
    http://www.tiro.com/John/BrillNames.txt


  • John Hudson
    John Hudson Posts: 3,190
    And wtf is a "Kmonospace"?
    It's a bad name for a K spaced on an East Asian ideographic full width.
  • Kent Lew
    Kent Lew Posts: 937
    Which, notably, maps to a PUA. Which is also why it’s hard to interpret.
  • John Hudson
    John Hudson Posts: 3,190
    U+FF2B isn't a PUA codepoint, although one might easily mistake it for one. It is one of the high FXXX codes used for presentation forms, in this case CJK full- and half-width forms [PDF].
  • Am I correct in thinking that, if all the proper GSUB entries are made, then the only compelling reason to retain pre-built composites of letters with accents, instead of using anchor attachments (to zero-width diacritics), is for the purpose of kerning combinations that might crash, such as lowercase "f" followed by ò?
  • John Hudson
    John Hudson Posts: 3,190
    I'm in favour of a text processing approach that would enable us to not only work with decomposed glyph strings for diacritics but would allow us to not include the precomposed character representations in the first place. At the moment, every supported Unicode character in a font has to be mapped to a glyph in the font cmap table. This obliges us to include all the precomposed diacritic characters encoded in Unicode (for backwards compatibility with older standards) as precomposed glyphs, even though these all have canonical decompositions to sequences of base characters and combining marks. Currently, some layout engines perform on-the-fly mappings from decomposed sequences in text to single character codes if they are present in the cmap table (which then need to be decomposed in GSUB if one wants to work with mark positioning instead), but no software seems to do the opposite: check the cmap for canonical base+mark character support to represent precomposed diacritics as decomposed glyph strings.

    A while ago, I proposed a new cmap format on the OT discussion list that would be able to map from Unicode values to sequences of two or more glyphs. So instead of a cmap that maps
    0x00E4 -> /adieresis
    you would have one that maps
    0x00E4 -> /a /dieresiscomb
    There wasn't a lot of support for the idea at the time, although Kemal Mansour at Monotype told me at TypeCon last year that he thought it might get more support if it were limited to Unicode characters with canonical decompositions. But that strikes me as the one place where it isn't really needed, since software could apply that sort of mapping without needing a special cmap format, since the canonical decompositions are all standard.

    I still think it would be worthwhile, and much more interesting in terms of new approaches to font design, to be able to map from Unicode codepoints to arbitrary sequences of glyphs. Decotype, as usual ahead of the curve, has been doing this with their Arabic layout model for many years: they do not need to fill their fonts with hundreds of precomposed Arabic letter glyphs, because they can map from Unicode direct to glyph assemblages.

    ____

    Issues with relying on GPOS mark positioning for diacritics:

    1. As ScottMartin notes, there using decomposed glyph strings introduces issues with kerning or, more generally, relationships to adjacent typeforms that involve non-adjacent glyphs. As I've ranted about at length elsewhere, especially in the context of Arabic, this is the worst aspect of OpenType Layout architecture: the interaction of different forms of GPOS requiring contextual rules. It is a bottleneck, both in terms of the difficulty of the development work involved and the processing hits during layout. New tools to automate building of the contextual GPOS lookups from a simpler data structure would help resolve the development side, but I'm pretty convinced that the long-term solution requires something like GPOS 2.0 with a more efficient way to represent the data to the layout engine.

    2. There is a bug in InDesign's {ccmp} feature support that causes failure of decomposition of precomposed Unicode diacritics to glyph sequences. We discovered this while testing the Brill fonts (which contextually decompose diacritics when followed by a combining mark; this is something we'd done previously in the Win7 version of Cambria, and which works perfectly with MS layout engines). We alerted Adobe, who confirmed the bug, but were not in time to get it fixed in CS6.
  • "...to automate building of the contextual GPOS lookups from a simpler data structure..."

    This I think is possible...

    "...with a more efficient way to represent the data to the layout engine."

    You mean the ones that exist?



  • John Hudson
    John Hudson Posts: 3,190
    edited June 2013
    I mean more efficient than the OpenType lookup structure. In the latter, lookups are delimited by their context strings, which means that for each different context you need a separate lookup. When you're dealing with situations in which you have interactions between kerning and mark positioning and contexts to control visual interactions of non-adjacent glyphs, things get out of hand very quickly -- or, if you prefer, very slowly, on the processing front.

    The typical Latin script situation is relatively simple -- something like a T followed by a lowercase letter carrying one or more combining mark glyphs above it can be contextually kerned -- but even this can become quite complex if one wants to do a good job and take into account different combinations of lowercase letter width and mark width.

    The situation in naturally complex scripts is far worse. So in the Aldhabi Arabic project that I presented at ISType last summer, we were dealing with interactive mark positioning and kerning taking place within a layout that includes vertical offsets of cursively joining letters. Trigonometric kerning followed by contextual adjustment of mark positions!

    At the moment, I'm working out contextual kerning for a traditional Southeast Asian script in which I need to control the spacing of subscript glyphs that are positioned underneath base glyphs and which might overhang the sidebearings of the base glyph, hence clashing with other subscripts. I've unitised the design to make it as easy as it can be, but it is still very far from easy.

    Having hit this bottleneck numerous times for many different writing systems, I'm convinced it is the worst element of OpenType architecture, and there must be a better way. One thing that comes to mind is additive spacing applied at a level above the individual glyph. For Indic-derived scripts, layout engines are already processing substitutions and positioning at a delimited cluster level, and for Arabic they are already analysing character strings to shape letter groups. What if, instead of relying on glyph-to-glyph sidebearing and kerning relationships, spacing was applied to clusters and letter groups, based on additive values of the visual widths* of the glyphs involved as adjusted by any internal positioning offsets (kerning, cursive attachments, mark attachments)? In a similar fashion, a Latin letter with one or more combining marks would be treated as a cluster and could be spaced as such relative to adjacent glyphs or clusters.

    * The notion of Visual Widths is important in this scheme, because existing layout technologies presume that a significant number of characters will be zero-width, i.e. have no advance width, even though they occupy visual space on the page or screen. Combining marks are expected to be zero-width, and OT Layout engines sometimes enforce this for glyphs identified as marks in the GDEF tables, even though some glyphs that need to be positioned with anchor attachments need subsequent width adjustments (examples of the latter would be Burmese subscript letters, some of which have spacing portions that rise to the right of the base letter). Now, imagine if you could properly position a combining mark above a (dotless) lowercase i and also take into account the visual width of that mark in spacing the resulting cluster relative to adjacent shapes such as the bar of the T. Instead of having to provide contextual kerning for the T in every combination of following lowercase letter and mark, you could simply identify the context (following T) in which you allow the visual width of the following cluster to override the sidebearing spacing of the lowercase base letter.

    That's one idea off the top of my head. I'm sure there are others and quite probably better ones.
  • Trigonometric kerning followed by contextual adjustment of mark positions, controlling the spacing of subscript glyphs that are positioned underneath base glyphs and which might overhang the sidebearings of the base glyph, and especially characters with zero-width, make my blood boil. All glyphs deserve space if they want some. I need to watch a hockey game to calm down.



  • John,

    Unfortunately, "post 3" TTFs aren't very reliable in the PDF workflow. I've seen problems with text accessibility (and even rendering!) with Cambria or Calibri in PDFs made from Word 2007 or 2010 and then viewed in Preview on Mac OS X 10.6.
  • John Hudson
    John Hudson Posts: 3,190
    Oh, I'm sure format 3 post table TTFs are not reliable in PDF workflow, and it doesn't surprise me that there might be issues beyond text reconstruction problems. As I tried to suggest, this format is an option if you don't care about downstream Acrobat issues. It might, for example, be attractive for webfonts, since format 3 post tables are so much smaller than format 2 tables. Of course, web-to-PDF workflows exist, but it seems to me easier to get that limited and specific PDF-creation model right than to try to accommodate all the varied and messy ways in which desktop apps create PDFs.
  • Adam Twardoch
    Adam Twardoch Posts: 515
    edited June 2013
    John — agreed, with the reservation that one of the important web-to-PDF workflows has been, and will continue to be, the web browser's File / Print function. But I agree that it's still a more controllable environment that "the rest of the wilderness".

    One observation I made when WOFF was created is that, with a new container format, we as a font community might have a chance to clean up some of the historical baggage of the SFNT container (i.e. OTF and TTF).

    Especially with the upcoming WOFF2, I'd be all in favor for sitting down and discussing a set of stricter rules when it comes to "what should be inside".

    While we all must agree that inside of the plain SFNT container, i.e. in barebones TTF and OTF, all kinds of dodgy stuff needs to be supported still, amd while WOFF1 had to focus on getting the container format working at all, so we didn't have too many chances to clean up some mess — Google has done the right thing and implemented the "OT sanitizer" as part of Google Chrome, which was, at least to some extent, a step in the right direction.

    Now, I think of WOFF2 not just as a container format, but also, potentially, as an "OpenType profile 2013". Here, we might have a chance to work out stricter recommendations for various tables, and various entries therein.

    For example, I'd *love to* issue a recommendation to completely deprecate the TrueType layout model, i.e. the old "kern" table, inside of WOFF2. I'd love to be able to say "inside WOFF2, modern layout models are highly encouraged, in particular the OpenType Layout model. Other layout models such as AAT or Graphite are also permitted".

    And then issue a recommendation that the "kern" table may only be used with the layout models that still rely on it primarily (which, I guess, AAT does, and perhaps Graphite), but if a WOFF2 font is made for OpenType Layout, then the "kern" table should be absent.

    Even within the OTL tables, we could throw out stuff. I mean, throw out more stuff than the current OFF or OT specs permit.

    While desktop systems may still rely on some heuristics, hooks in the old code to support this or that — with WOFF2, we could "kill" those. cmap 1.0 should be gone. Various name table entries should be gone. Perhaps some device table metrics. Perhaps a requirement on a particular version of particular tables (e.g. version 4 of OS/2). post table? Only format 3 allowed. Things like that.

    WOFF could be for SFNT what PDF/X is for PDF.
  • Adam Twardoch
    Adam Twardoch Posts: 515
    edited June 2013
    Ps. for OTL, disallow the "deva" script tag or some feature tags that are clearly outdated. Remove cmap 3.0... etc.

    In short: if we introduce a new container, we should also "sanitize" things which can go into the container. Because it's a significant chance to do that.
  • John Hudson
    John Hudson Posts: 3,190
    Especially with the upcoming WOFF2, I'd be all in favor for sitting down and discussing a set of stricter rules when it comes to "what should be inside".
    Will you be coming to TypeCon? We're planning a W3C Webfonts Working Group meeting on the Thursday (probably afternoon). In the meantime, perhaps you'd like to raise some of these topics on the WG list mail?

    Instinctively, I'm not sure that sanitising at the format spec level is the right approach if this involves normative MUST and MUST NOT actions from user agents encountering WOFFs that contain 'unsanitary content'. On the other hand, I can see the value to everyone in publishing a recommendation for how to build webfonts that don't contain a huge amount of legacy bloat.
  • Doesn’t the WOFF 2.0 already have plans to reduce redundant information?

    For PDF creation with format 3 post table, applications should use cmap to generate postscript names, with alternate glyphs or ligatures some other mechanism analyzing how they are accessed could be used. FreeType will be doing something like that to figure out how to best autohint.

    For kerning and spacing, the MATH table has some interesting approach, maybe some of its structures could be used or improved for a more flexible GPOS.
  • John Hudson
    John Hudson Posts: 3,190
    For PDF creation with format 3 post table, applications should use cmap to generate postscript names
    If an application creating a PDF has that kind of sophistication, it can simply write the original character strings to the PDF and these issues regarding Acrobat text reconstruction can, in theory at least, be ignored. These issues arise when PDFs are being distilled from print streams that do not contain the original character strings. So there's no path via the cmap table either to map to names or to generate names, because there are no character codes.
    _____

    The MATH table cut-in kerning is clever in that it enables spacing of differentially scaled and vertically offset glyphs. Toolwise, its currently pretty arduous work to set the cut-ins, but the data model seems okay.