u+00af and u+02c9

James Puckett · July 2018

Why do people put u+00af (macron) and u+02c9 (modifier letter macron) in fonts? Why is u+02c9 appearing in fonts but not other modifier letters? Is this just a mistake that one person made and lots of other people have copied?

Thomas Phinney · July 2018

These days, I generally put in both combining diacritics (modifier letter whatsitsname) and standalone diacritics (whatsitsname). It isn't really any extra work, and at least for basic accents, they are part of some standard character sets, basically for legacy reasons.

But I wouldn't be wagging my finger very hard at somebody who left out the standalone diacritics.

Are you seeing this specific to the macron, though? With the same designer leaving out the other modifier letters they have standalone accents for? That is... odd.

I can also imagine somebody retrofitting an existing font that has precomposed accents for western European characters, adding CE support and doing that through combining accents. But in such a case, I would expect to see the combining accents for other CE characters, such as the Hungarian double-acute-style umlaut (humgarumlaut).

James Puckett · July 2018

What’s the difference between the combining marks and the modifier letter marks? Are the modifier marks supposed to be zero width so they appear atop letters?

I asked this because the modifier letter macron is in Underware’s latin plus, which means it’s turning up in new fonts based on latin fonts. Given how many people think their work supports the Oneipot language I assume that modifier letter macron is showing up in lots of new fonts.

John Hudson · July 2018

Modifier letter marks are not zero-width: these are spacing signs that are used in a variety of ways in phonetic transcription.

Combining marks should generally be zero-width, and may need mark-to-base and mark-to-mark anchor attachments to be properly useful.

Unicode contains a number of duplicate spacing accent characters, which were inherited from previous standards, and disunified because either they don't always exactly behave the same (the relative height of modifier letter marks is important for some transcription work, so they don't always align with generic spacing accents). Hence, the dual codepoints U+00AF and U+02C9 which are both spacing characters, and in some fonts may be mapped to the same glyph. Note however, that in the Unicode charts U+00AF, the inherited ANSI 'MACRON' character, is shown as a longer, over line, while U+02C9 is a spacing version of the combining macron, used in some phonetic notation, as I recall, as a high tone indicator.

Craig Eliason · July 2018

Related question: is there utility to including both spacing and non-spacing versions of Greek (e.g. tonos) and Cyrillic (e.g. Cyrillic breve) diacritics in fonts?

Denis Moyogo Jacquerye · July 2018

U+00AF MACRON (= overline, APL overbar) is really an overline. Most fonts include it because it’s in common legacy encodings like Windows CodePage 1252 or Mac Roman. It’s mostly meant to be used as the U+005F LOW LINE (= underscore) but above. These legacy encodings either used it as macron / overline above letters or as a spacing overline or even both, just like they used U+005F as macron below / underline or as a spacing low line or even both. When alone, it usually has a wider shape than the macrons used on top of letters and may even connect when following or preceding itself.

Edit: Apparently it’s called “high minus” in APL.

U+02C9 MODIFIER LETTER MACRON was encoded to represent the spacing diacritic used for tone marking. It usually has the same shape as the macron used on top letters. Many fonts use it as a component for letters with macron.

U+0304 COMBINING MACRON is a non-spacing diacritic, it should be used as the macron on top of letters (whether they have composed characters encoded or are composed by the shaping engine with OT features).

Max Phillips · July 2018

Combining marks should generally be zero-width […]

@john hudson What happens if they're not? And does this apply to legacy standalone diacritics, too?

Vasil Stanev · July 2018

Max Phillips said:

Combining marks should generally be zero-width […]
@john hudson What happens if they're not? And does this apply to legacy standalone diacritics, too?

I also would like to know.

John Hudson · July 2018

What happens if they're not?

Ah now, the answer to that depends on the platform, the specific character and script, whether the glyphs are categorised as marks in the GDEF table, and probably other factors that escape me at the moment. The most important consideration in this respect is the GDEF categorisation, because at least some layout engines — notably Microsoft's Uniscribe — will tend to enforce a zero-width on any glyph defined as a mark in GDEF* (excepting in a font with a monospace flag set, in which case you need to collapse the width of marks in a GPOS lookup prior to positioning). So my recommendation is that any time you categorise a glyph as a mark in GDEF you should ensure that it is zero-width, because better you control this than run into situations where some shaping engines zero the width and some don't.

*Not sure whether this is the case for all the shaping engines in Uniscribe, and whether it is true for Latin script.

[Note that there are situations in some complex scripts where there are post-positioning spacing signs that you need to categorise as marks in GDEF so that you can put them into GSUB mark filter sets, so they are skipped in ligation between glyphs on either side of the sign. In that case, one spaces them in the glyf/CFF table as zero-width, since they're nominally marks, and then uses GPOS to add width to them.]

And does this apply to legacy standalone diacritics, too?

No. The legacy standalone diacritics are spacing characters. If you look at e.g. U+0060 GRAVE ACCENT in Unicode, you'll see it is annotated 'this is a spacing character'. Some of the legacy spacing accents also have compatibility decompositions to the space character + combining diacritic e.g. U+00B8 CEDILLA ≈ 0020 0323.

Michel Boyer · July 2018

I always took for granted that I could rely on the FontForge metrics window to see how the glyphs are expected to be positioned. Here the combining diacritic in U+0302, the font is Source Sans pro, mark is activated and this is the 2012 version of FontForge. All that is changed in the clip is the width of uni0302.

Was I mistaken?

Michel Boyer · July 2018

Your browser does not support HTML5 video.

Michel Boyer · July 2018

I just checked with TextEdit on the macintosh and uni0302 quite wide and the width had no effect on the rendering.

John Hudson · July 2018

I just checked with TextEdit on the macintosh and uni0302 quite wide and the width had no effect on the rendering.

I'm not sure what you mean here. If the /uni0302/ glyph is as you have it in the FontForge screenshot, then it is zero-width, not 'quite wide'.

If you made a version in which you gave the /uni0302/ glyph an advance width of non-zero, and you still get the same positioning and spacing of adjacent letters in TextEdit as when the glyph was zero-width, that is an indication that Apple's text engine is zero-ing the width of the glyph because it recognises it as a mark (presumably because it is being identified as such in the GDEF table).

Michel Boyer · July 2018

John Hudson said:

I just checked with TextEdit on the macintosh and uni0302 quite wide and the width had no effect on the rendering.
I'm not sure what you mean here. [...]

If you made a version in which you gave the /uni0302/ glyph an advance width of non-zero, [...]

Yes that is exactly what I did.

Khaled Hosny · July 2018

AFAIK, HarfBuzz and Uniscribe enforces zero-width for mark glyphs (either based on GDEF glyph classes or Unicode character properties, depends on the script IIRC), but Core Text does not enforce it, unless they changed this behavior.

Thomas Phinney · July 2018

Michel Boyer said:

I always took for granted that I could rely on the FontForge metrics window to see how the glyphs are expected to be positioned. Here the combining diacritic in U+0302, the font is Source Sans pro, mark is activated and this is the 2012 version of FontForge. All that is changed in the clip is the width of uni0302.

Was I mistaken?

You were mistaken.

More clearly: as discussed in the thread, the results can be engine-dependent. If you build things “properly,” you will generally get the same result across all engines.

But if you do wacky things like give a non-zero advance width to a character Unicode defines as a zero-width combining mark, the results will vary. Some engines will quite reasonably ignore the advance width. Others may respect it.

John Hudson · July 2018

If you actually want a nominal mark glyph to have an advance width, it is fairly reliable to do this via GPOS. This is something I do quite frequently in complex script fonts, especially South Indian scripts where it's necessary to kern off some marks, and hence it is easiest to be able to first give the marks a consistent left or right sidebearing amount. And as mentioned previously, there are some properly spacing signs that need to be classified as marks in order to be skipped in GSUB, and which then need to have their widths added back.

Michel Boyer · July 2018

Thomas

According to Unicode Technical Note #2, here is the way to get the bounding box of a character with combining diacritical marks:

   combination_bounding_rect = base_bounding_rect;
   display the base glyph at (0,0);
   while (more marks) {
       display the mark relative to combination_bounding_rect;
       increase combination_bounding_rect by
   		the extent of mark_bounding_rect;
   }
   move horizontally by the width of the base glyph;

That code assumes the diacritic may have a non zero advance and the examples given are with standard combining marks for Latin. Where do you find in the standard that the advance should be zero?

By the way, if I understand correctly, what FontForge does not do correctly is the last line, i.e. move horizontally by the width of the base glyph.

[1 hour after...] The code simply uses the bounding rectangle of the diacritic, not its advance (character width). The question concerning the width remains though, especially if my understanding of the last line is correct.

Michel Boyer · July 2018

John Hudson said:

[...] And as mentioned previously, there are some properly spacing signs that need to be classified as marks in order to be skipped in GSUB, and which then need to have their widths added back.

John, where is that documented?

John Hudson · July 2018

John, where is that documented?

It isn't, so far as I know. It's in the 'things I needed to figure out' category. This is a strategy for making fonts, not a requirement of any standard. The circumstances in which one uses this method depends on the script, the design, and the approach taken to representing a particular combination of glyphs within a cluster.

OpenType Layout lacks a move operator, i.e. there is no way in GSUB to explicitly move a glyph from one position in the glyph string to another. That means that if one wants to ligate two glyphs that are not adjacent in the glyph string, one needs to find some other method to get them next to each other for purposes of the ligature lookup.

One method is to use a two-step contextual substitution to insert a duplicate of one of the glyphs in a desired location and then remove the initial instance of the glyph. This only works, though, so long as the complete context is perfectly and unambiguously definable.

The other method is to skip the intervening glyphs, which is only possible by classifying them as marks. Since shaping engines may or may no zero the width of marks, this obliges one to set the advance width of such glyphs to zero at the glyf/CFF level, and then to use GPOS to manage the width after GSUB has been performed.

A good example of this is in Telugu, where vowel signs that need to ligate with the base consonant in a cluster (green) might be separated from that glyph by a postscript form of a second consonant (red). That postscript glyph is not handled as a mark for positioning, since it is a spacing sign, but needs to be classified as a glyph in GDEF so that it can be put into an appropriate mark filtering set for the ligature lookup.

Image: https://us.v-cdn.net/5019405/uploads/editor/se/dw0rcegllec5.png

There has been some talk over the years about introducing new mechanisms that would make this sort of stuff unnecessary. One idea I had was to make it possible to filter arbitrary glyphs, not just marks. Another, proposed by Martin Hoskens, was to add an explicit move operator to OpenType Layout. The latter is probably the better idea, but either would involve a major overhaul of OTL at both spec and implementation levels, with all the attendant issues around staggered support on different platforms, and there doesn't seem to be a lot of enthusiasm for such disruption given that we have workarounds that get the job done with existing support.

Claudio Piccinini · March 2023

Denis Moyogo Jacquerye said:

U+00AF MACRON (= overline, APL overbar) is really an overline. Most fonts include it because it’s in common legacy encodings like Windows CodePage 1252 or Mac Roman. It’s mostly meant to be used as the U+005F LOW LINE (= underscore) but above. These legacy encodings either used it as macron / overline above letters or as a spacing overline or even both, just like they used U+005F as macron below / underline or as a spacing low line or even both. When alone, it usually has a wider shape than the macrons used on top of letters and may even connect when following or preceding itself.
Edit: Apparently it’s called “high minus” in APL.

U+02C9 MODIFIER LETTER MACRON was encoded to represent the spacing diacritic used for tone marking. It usually has the same shape as the macron used on top letters. Many fonts use it as a component for letters with macron.

U+0304 COMBINING MACRON is a non-spacing diacritic, it should be used as the macron on top of letters (whether they have composed characters encoded or are composed by the shaping engine with OT features).

I’m reviving this old discussion because I am a bit confused by Denis statement above, as far as the U+00AF is concerned.
I usually use U+0304 to combine with letters, but as a rule I used to duplicate it in the U+00AF to have both zero and non-zero width macrons.
But according to Denis that would be a character for different use, and since it is part of the range 0080 C1 Controls and Latin-1 Supplement, I was wondering whether it would be correct to design it as an overbar (mirroring the advance width of the underline) and leave the spacing equivalent of U+0304 under U+02C9 as righteously pointed out.

John Hudson · March 2023

I’ve made fonts in which I treat U+00AF as a spacing macron, which is a common convention, and fonts in which I treat it as a spacing overline (same width as the underline _ ), which is more accurate according to Unicode but less common. I think the prevalence of the macron implementation is because Adobe decided to assign U+00AF the glyph name /macron.

Effectively, I don’t think it much matters which approach you use, because this is a legacy character that doesn’t get used a lot.

Claudio Piccinini · March 2023

John Hudson said:

I’ve made fonts in which I treat U+00AF as a spacing macron, which is a common convention, and fonts in which I treat it as a spacing overline (same width as the underline _ ), which is more accurate according to Unicode but less common. I think the prevalence of the macron implementation is because Adobe decided to assign U+00AF the glyph name /macron.

Effectively, I don’t think it much matters which approach you use, because this is a legacy character that doesn’t get used a lot.

Thanks John. I used U+0304 and U+02C9, which is more correct. The fact that U+00AF is called "macron" is confusing. So, since it has to be included as it’s part of many default Encodings, do you think making it a duplicate of U+02C9 and treating it as a spacing macron? At this point, since I can’t see much the use of having a duplicate of U+02C9, I was considering to have it as a spacing overline.

John Hudson · March 2023

So, since it has to be included as it’s part of many default Encodings, do you think making it a duplicate of U+02C9 and treating it as a spacing macron? At this point, since I can’t see much the use of having a duplicate of U+02C9, I was considering to have it as a spacing overline.

Either of those options is fine for most users and most situations.

Denis Moyogo Jacquerye · March 2023

U+00AF is ambiguous in Unicode and can have various forms (overline, spacing macron or even non-spacing overline or non-spacing macron in some legacy encodings), whereas U+02C9 is stricly a spacing macron and U+0304 strictly a combining macron.

Claudio Piccinini · March 2023

Denis Moyogo Jacquerye said:

U+00AF is ambiguous in Unicode and can have various forms (overline, spacing macron or even non-spacing overline or non-spacing macron in some legacy encodings), whereas U+02C9 is stricly a spacing macron and U+0304 strictly a combining macron.

Yes, I think now it has been clearly shown, so one has just to decide what to do with that ambiguity, and making it into one of the options. But Unicode should make it less equivocal, maybe we should notify them? How is it usually done?

John Hudson · March 2023

Unicode should make it less equivocal

There are a few characters in Unicode whose ambiguity is the result of legacy encodings and use. The ambiguity is part of the identity of these characters.

Another well known one is U+002D, which is both hyphen and minus. See also the concurrent discussion about the accepted ambiguous use of U+05B9 as applied to the Hebrew letter vav.

Unicode typically provides disambiguating encodings alongside of the ambiguous characters, rather than making the making the identity of the latter less equivocal.

Claudio Piccinini · March 2023

Yes, for some it is understandable. But where supposed usage comes from a legacy, shouldn’t this be specified? Today I learned a bit about the lozenge, which seems univocal that comes from adding machines, where it was used alongside with the hashtag/octothorpe version of numbersign, which was useful as it excluded its possible function as a mathematical operator (which have specific codepoint).
So I was able to decide, since I am not interested in incorporating a comprehensive math symbol set but I find the lozenge quite useful as a pointer or graphic sign.

John Hudson · March 2023

where supposed usage comes from a legacy, shouldn’t this be specified?

Well, it sort of is:

Image: https://us.v-cdn.net/5019405/uploads/editor/9s/49ezjg12oed3.png

So here we see

that this character is the legacy APL overbar;
that is is an overline but that it has different spacing behaviour from the U+202E overline;
that it has a non-canonical equivalance to space+combining macron.

That all lets you know that this is an ambiguous character that could be graphically treated in at least a couple of different ways, depending on the kind of font one is making and its intended purpose.

Claudio Piccinini · March 2023

John Hudson said:

where supposed usage comes from a legacy, shouldn’t this be specified?
Well, it sort of is:

So here we see
that this character is the legacy APL overbar;
that is is an overline but that it has different spacing behaviour from the U+202E overline;
that it has a non-canonical equivalance to space+combining macron.
That all lets you know that this is an ambiguous character that could be graphically treated in at least a couple of different ways, depending on the kind of font one is making and its intended purpose.

Thanks! This information is also within the PDF Unicode has of the specific ranges? I’ll have to check.

John Hudson · March 2023

Yes, this is from the code chart PDF:
https://www.unicode.org/charts/PDF/U0080.pdf

u+00af and u+02c9

Comments

Categories