Precomposed fractions — waste of time and space?

John Savard · January 2020

Adam Jagosz said:

A hard-coded fi ligature is a disservice to everyone. It renders the text unsearchable and unnormalized. It belongs entirely to the song of the past.

In a word processor that supports OpenType, just as one can choose which typeface is used for a part of the text, one can also say to use that typeface with the rules for the English language, or with language features turned off. So if one wants fi not to print as a ligature, one can indeed specify it - and if one wants fi to print as a ligature, one can specify that without making changes to the text.

Since it is desired that PDF files be searchable, they can't be thought of as printer output files either - while a hard-coded fi doesn't belong in any kind of document, that it would be sent to a printer, assuming the printer recieved Unicode text rather than raster data, is reasonable. However, if the printer handles OpenType fonts, it could also do the conversion too. This raises an interesting question. I'm not sure I want standards to require that output devices have a minimum level of sophistication.

Helmut Wollmersdorfer · January 2020

Adam Jagosz said:

A hard-coded fi ligature is a disservice to everyone. It renders the text unsearchable and unnormalized. It belongs entirely to the song of the past.

The German ß is also a longs_z ligature. It established as an own letter in the official alphabet. Swiss German does not use it, they use ss instead. Historically w is also a v_v ligature, j a begin form of i, and v a begin form of u.

According to Unicode policy ligatures and combined characters got their own code point registered, if they appeared in legacy encodings. But other combinations like m+combining_overline or longs_c_h ligature have no chance to get a precomposed code point registered in Unicode.

Searching is mostly underestimated. Professional search engines like Google support very sophisticated smart matching. The majority of software developers don't understand the basics of Unicode and their code is broken. E. g. the name "König" can appear in the wild encoded with ö (precomposed), o+combining_diaresis, or without diacritic. Good programming languages support accent-insensitive matching. Seldom, but it appears in the wild, a word can contain a letter from a foreign script, e. g. Cyrillic e. This can make you crazy in debugging, if you don't consider this possibility. Another known problem in scientific information retrieval are "spelling" variants fro Greek letters in name of chemicals. German ß for beta.

Adam Jagosz · January 2020

I meant that using a hard-coded fi ligature in running text is an error. I know what legacy means as well. Come on now, I'm not suggesting we type vv instead of w and use OpenType to join them.

Whereas fi is not an established letter in any orthography I've heard of.

Simon Cozens · January 2020

Adam Jagosz said:

A hard-coded fi ligature is a disservice to everyone. It renders the text unsearchable and unnormalized.

Yes and no. I think this is the nub of the issue.

Let's define a rendering as the combination of a stream of characters (the text) plus a font. The font may have features which change how those characters are displayed, but obviously changing the font doesn't change the text; they're two separate parts of the rendering. So if the font specifies a feature such as a hard fi ligature or a crazy frac feature or whatever (but let's use fi as an example of the general idea), this won't affect your ability to search - it's purely presentational. Your input "text" is just the same as it was, there is no LATIN SMALL LIGATURE FI in it, you can put the cursor between the two characters "f" and "i", you can search for "fi", and it will all work.

Where things get a bit more complicated is when you want to represent a specific rendering as a file - which you would typically do as a PDF. Once we have a rendering, there is no input text any more, only font glyphs.

As an experiment, I just created a PDF with the single word "official" and looked at the contents. The glyphs are painted using the TJ operator which is given a list of glyph IDs, like so: [<0050 00ef 0044 004a 0042 004d>]TJ. This is just references to glyphs in the font, not Unicode codepoints. No text.

But while there's no text, we can reconstruct a text out of this, because another part of the PDF contains a character map which maps these glyph IDs back to Unicode codepoints so that the PDF client can turn the glyph stream into characters which it can then search, copy/paste, etc. And yes, in this map, we have "<00EF> <FB03>", mapping the "ffi" glyph back to the Unicode LATIN SMALL LIGATURE FFI codepoint.

Does this mean that the text in the rendering is unsearchable? As it happens, no. So long as the character map is present, the PDF client can map the glyphs back to characters, and even though the ligature feature has caused a LATIN SMALL LIGATURE FFI to appear in that character stream, you can still search for "of" and find it. This is because PDF clients know that they need to perform Unicode decomposition when searching - as in fact needs to happen with any client handling Unicode text.

So is the text in the rendering unnormalized? Yes, but does this actually matter? Normalization is just what you do when you're a text processing client and someone hands you a string of Unicode codepoints and you want to do some processing on them.

The character map which enables this is actually an optional extra to the PDF specification. libtexpdf (which I used to create the PDF) embeds them, and that's what which makes the text searchable and allows for some reconstruction of the input text, but it doesn't have to. But this is just a rendering; you shouldn't really have an expectation of being able to reconstruct the original input text stream byte-for-byte. any more than you should be able to reconstruct a recipe from a dish. If you want to do that, use the original document.

Ah, but what if you do want to save a rendering which preserves the original input text stream byte-for-byte - maybe for archival purposes or something like that? Well, this is where you use PDF/A-1, which adds another text stream: a tree representing the structure of the document, with each structural element (paragraph, heading, etc.), containing an ActualText key mapped to the original input stream.

So in summary:

Your font features don't affect the text stream in the original document.
Your font features do affect the way that text stream can be reconstructed out of the rendering, but in almost all cases it doesn't actually matter.
If it does matter, there are ways you can save the rendering to avoid that.

Adam Jagosz · January 2020

> This is because PDF clients know that they need to perform Unicode decomposition when searching - as in fact needs to happen with any client handling Unicode text.

Not with programming IDEs, though. When I'm hand editing a text where combining marks occur, a search for the combining mark doesn't bring up any precomposed codepoints that contain it in their decompositions, and a search for a precomposed diacritic doesn't match its decomposition. Nor would I expect any of these behaviors. In fact, sometimes when I want to just copy the combining mark, I leave Visual Studio Code and rely upon Notepad++, which allows that.

Helmut Wollmersdorfer · January 2020

Adam Jagosz said:

Whereas fi is not an established letter in any orthography I've heard of.

I cannot prove all orthographies. But hardcoded fi appears in texts in the wild, and e. g. Tesseract OCR outputs it. I agree with you, that using fi is not a good practice, and it will be hard enough to convince the developers of this one software to stop generating ligatures in the text result.

Andreas Stötzner · January 2020

The German ß is also a longs_z ligature.

just for the record: it isn’t.

John Savard · January 2020

Andreas Stötzner said:

The German ß is also a longs_z ligature.
just for the record: it isn’t.

No, it's a long s - s ligature. Or at least it looks like one in most Roman fonts, even if Germans no longer think of it that way.

Adam Jagosz · January 2020

Hmm. That's a bit more complex, I think.

Just because today the replacement for scharfes S is “ss”, and the combination “sz” is its own beast with different pronunciation, doesn't resolve anything. Orthography especially in the beginning of literary languages was a tricky and unsettled thing, and did not always result in a one-to-one mapping.

I'm not an expert, but after a quick peek through Wikipedia (en) I can believe that both ligatures existed: sz (ſz) in Blackletter type, and ss (ſs) in Roman type. They were later conflated into a single entity by the influence of the available fonts.

Just think about it: it's also called Eszett. As Gothic types predated Roman types in German usage, I'd be partial to the “sz” explanation and consider “ss” a later rationalization.

Fun fact: (a glyph similar to) eszett used to be employed (I don't know how commonly) in Polish as an s_z ligature. (Keyword: Nowy karakter polski. Also check out the super sweet c_z, d_z and r_z ligatures!)

Image: https://us.v-cdn.net/5019405/uploads/editor/qc/gzf2b7kigvax.jpg

Lastly, think of all the last names that have SZ pronounced like S: Rachel Weisz etc.

Andreas Stötzner · January 2020

it's a long s - s ligature

just for the record: it isn’t.

I do not want to hijack this conversation with that matter.

Helmut Wollmersdorfer · January 2020

Andreas Stötzner said:

The German ß is also a longs_z ligature.
just for the record: it isn’t.

Sure? In Blackletter it's clearly longs_z, from Gutenberg onwards into the 20th century. German texts printed in antiqua are very rare before ~1830.

From Johannis Bellini, Teutsche Orthographie, 1642

The first, bold face, is Postillen-Schrift (a forgotten typeface), and the other Fraktur. The ß of Alte Schwabacher is between, the descender of the z not connected. Some Blackletter typefaces have both--longs_z and longs_s printed on the same page.

See also German and English Wikipedia:

https://de.wikipedia.org/wiki/%C3%9F
https://en.wikipedia.org/wiki/%C3%9F

[Deleted User] · January 2020

The user and all related content has been deleted.

Thomas Phinney · January 2020

James Montalbano said:

A whole bunch of precomposed fractions have their own code points, Why not include them?

Web fonts.

Some folks want to maximize functionality, but minimize file size. Notably: for web fonts.

If the same functionality can be achieved without putting the precomposed glyph in the font, it saves space (even if the precomposed fractions use composites). If there is no loss of functionality, leaving unnecessary stuff out is desirable.

Erwin Denissen · January 2020

Thomas Phinney said:

James Montalbano said:

A whole bunch of precomposed fractions have their own code points, Why not include them?

Web fonts.

Some folks want to maximize functionality, but minimize file size. Notably: for web fonts.

If the same functionality can be achieved without putting the precomposed glyph in the font, it saves space (even if the precomposed fractions use composites). If there is no loss of functionality, leaving unnecessary stuff out is desirable.

Nothing wrong with trying to minimize file size, as long as you include all characters your customers use. If a web page contains characters that are not within the font, a fallback font will be used to display those missing character(s).

Although not exactly the same issue *, this page illustrates what happens if you don't include the onehalf (½) character:

https://www.washingtonpost.com/local/public-safety/former-pharmacist-sentenced-to-7--years-in-fatal-shooting-of-husband/2019/09/06/ad9f3c6a-d0b4-11e9-87fa-8501a456c003_story.html

It shows the head line in Postoni, except for the onehalf that uses Georgia.

*) in this case the font does include the onehalf character, but the style sheet contains a unicode-range CSS descriptor that lacks the onehalf codepoint (U+bd).

Thomas Phinney · January 2020

As with many things, it depends on the typesetting and text engine.

> It shows the head line in Postoni, except for the onehalf that uses Georgia.

I see the whole thing in Postoni (on Mac Chrome). Guessing it has been fixed since you posted the example, though?

Erwin Denissen · January 2020

This is what I see browsing the page with Chrome on Windows:

Image: https://us.v-cdn.net/5019405/uploads/editor/bf/0j6b440y0tjg.png

The onehalf is in Georgia Bold.

Erwin Denissen · January 2020

And this is how it should look like:

Image: https://us.v-cdn.net/5019405/uploads/editor/ie/775lygldanob.png

Thomas Phinney · January 2020

Right. And I see the latter in Chrome 79.0.3945.79 on Mac.
(I do not have Postoni installed locally.)

Image: https://us.v-cdn.net/5019405/uploads/editor/05/nifoay8tadlo.png

Ray Larabie · January 2020

It looks like it's falling back to a different Bodoni.

Thomas Phinney · January 2020

Agree — there are a variety of small differences. Total height, thickness of the fraction bar, length of serifs on the “1” and the bottom left of the “2” are the most obvious.

Khaled Hosny · January 2020

I don’t include frac feature in my fonts, I find it too unreliable for my taste (is 1/2 half of first day of the second month or one of two?).

HarfBuzz (and by extension applications using it like Chrome, Firefox, Android, LibreOffice, etc.) has a little neat feature where fraction slash(U+2044) would trigger numr feature for the sequence of numbers before it and dnom for the numbers after it and frac for the whole sequence. This follows Unicode recommendation, unambiguous and can be used by the user even if the application does not provide feature control.

So 1⁄2 turns into a fraction (provided font has the required features) and 1/2 don’t.

Khaled Hosny · January 2020

(the buggy font used by this forum has blank glyphs for many characters, so the fraction slash I used above shows blank space instead of letting the browser use a fallback font!).

John Hudson · January 2020

I implement the frac feature, but I don't attempt to get clever with the context. I believe the frac feature should only be applied selectively, either by the user selecting a specific n/n sequence and applying the feature, markup interaction, or automatically triggered by the use of U+2044 as per Unicode and the Harfbuzz implementation mentioned by Khaled.

Mike Wenzloff · January 2020

In nearly every document I layout, the frac feature isn't used often/at all with day/month type of dates. I've laid out a lot of such documents since OT features became a thing and cannot remember the two being used together.

If I were to have such a document, I suppose I would insert a zero width space in the date and that stops a frac feature from triggering on it. dd/mm/yyyy or yyyy/dd/mm dates are not affected by the frac feature in the fonts I've made.

Image: https://us.v-cdn.net/5019405/uploads/editor/l7/iqfw2gy43uip.png

However, I would/do turn on the frac feature globally as long as doing so makes layout expedient and handle the edge cases as needed.

Adam Jagosz · January 2020

@Khaled Hosny That's super interesting to learn, thanks! This is a neat feature of the browsers. And using the proper fraction character as per Unicode is always commendable.

I'm seeing it just fine, as the font has been changed recently. You're probably seeing cached CSS.

I made a small experiment in Firefox and Chrome, Windows and I was able to disable this default behavior by disabling frac feature along with either of dnom and numr.

Image: https://us.v-cdn.net/5019405/uploads/editor/mk/7bbkhft6o6zn.png

I'm guessing that's just a side effect of the implementation, though. But the thing is, in an app, for instance a font testing site, that controls font-features in a rather gruesome way of glueing together all possible feature tags instead of the recommended specific CSS properties (font-variant-numeric etc.), this behavior will be disabled.

Erwin Denissen · January 2020

Ray Larabie said:

It looks like it's falling back to a different Bodoni.

The web page contains this:

<div>.font--headline{font-family:Postona,BodoniSvtyTwoITCTT-Book,georgia,serif;line-height:1.1}<br></div>

My Windows system has Georgia and I suspect your Mac has Bodoni. So both show onehalf through a fallback font.

Bottom line remains; include the characters that your clients want to use.

Helmut Wollmersdorfer · January 2020

Khaled Hosny Didn't know fraction slash(U+2044) and it's special purpose. This would also allow fractions with letters like x/y in a safe way of semantic encoding.

Helmut Wollmersdorfer · January 2020

Thomas Phinney said:

Web fonts.

Some folks want to maximize functionality, but minimize file size. Notably: for web fonts.

Not only webfonts.

If the applications allows it and size is a problem, a font with reduced character set can be compiled. Linux Debian installer did this 15 years ago, maybe they still do. At this time the installer supported ~60 languages, every piece of text appearing on the screen in translation files. So it's easy to know all used characters and compile a font automatically with only these characters. Of course it needs all of ASCII, as a user can enter it at installation time.

A display only website could do the same thing. But this needs to run an analysis and font compilation after each change of the content. I have an online dictionary with ~150 languages. All of them can appear on one page. In the query form a user can potentially enter any text. At least I decided against using a webfont, and risk unrendered characters in the users client like Wikipedia does.

Nina Stössinger · February 2020

John Hudson said:

I presume what you mean is that the substitution points from a glyph that is mapped to one codepoint to a glyph that is mapped to another codepoint. That's discouraged for a couple of reasons

John, would you mind expanding on what these reasons are? I’ve long been under the ambient/general/vague impression that this is not considered best practice but just realized I don’t really understand why.

John Savard · February 2020

Nina Stössinger said:

John Hudson said:

I presume what you mean is that the substitution points from a glyph that is mapped to one codepoint to a glyph that is mapped to another codepoint. That's discouraged for a couple of reasons

John, would you mind expanding on what these reasons are? I’ve long been under the ambient/general/vague impression that this is not considered best practice but just realized I don’t really understand why.

Come to think of it, I don't understand why either. I understand why it would be very bad for a substitution to point from a codepoint to another codepoint. But glyphs are just marks that go on paper. So pointers go from codepoints to glyphs, and if a substitution points a codepoint from one glyph to another, if that other glyph happens to look like one used with another codepoint, of course they both are merged to save storage.

So I understand that it's very bad for a font (as opposed to an application) to change ½ into the sequence of characters 1/2. But if the best your font can do is represent ½ as something that looks like 1/2, I don't understand that there's anything wrong with using the same glyphs that were used to actually print 1/2.

So in other words, my problem is that I thought that while characters have semantics, I also understood that glyphs do not have any semantics, they're just geometrical descriptions of areas to be painted black.

Precomposed fractions — waste of time and space?

Comments

Categories