Separate language codes for different Englishes

Nick Shinn · January 2018

I would like to code fonts to behave differently for North American and British English.
(In particular, for punctuation.)
Is this possible?

Mike Wenzloff · January 2018

When I made,a font primarily to be used in some journals that quoted later 18th, early 19th century French, I did proper French punctuation in a stylistic set. I was doing the layout, though, and did it to speed up both the import and any keyed corrections. I don't know if I would do so in a released font.

But there isn't any reason I couldn't move or copy the chaining contexts into a locl. Maybe. But I think that would mess up French users keying or laying out their work.

John Hudson · January 2018

Examples of the kind of things you're wanting to handle in this way, Nick?

Mike, are you talking just about the conventional French spacing of punctuation, or e.g. representing raised quote mark characters with guillemet glyphs?

Mike Wenzloff · January 2018

John,

All the above and more.

The way the text came to me was keyed-in by an English speaking person using English convention. So basically the font did the substitutions of both primes and typographical quotes, the spacing, etc. It also dealt with a set of medial /s forms according to what I could ferret out for early 19th century French rules...it at least matched the manuscripts being quoted.

Nick Shinn · January 2018

Examples of the kind of things you're wanting to handle in this way, Nick?

I would like to replace quotesingle with a right quote mark for North America, to remedy the “smartquote” fail that generates ‘18, rock ‘n’ roll, etc.

(Still can’t understand why there aren’t separate Unicode points for right quote and apostrophe.)

Also, for some typefaces, flipped left quotes.

Hrant Հրանդ Փափազեան Papazian · January 2018

There is a separate Unicode for apostrophe; software is just still stuck on QWERTY overloading.
Previously: http://typedrawers.com/discussion/comment/24738/#Comment_24738

André G. Isaak · January 2018

There is a separate Unicode for apostrophe

U+02BC is named ‘apostrophe’, but this character is classified as a modifier letter rather than as punctuation. I think this was intended more for words like ’alif or O’odham where it represents a glottal stop. (Similarly U+02BC should be reserved for words like Hawai‘i).

André

Kent Lew · January 2018

André — I think you mistyped. I believe you meant U+02BB for Hawaiʻian okina in your last sentence.

joeclark · January 2018

Examples of the kind of things you're wanting to handle in this way, Nick?

The only material differences between en-CA/en-US (one set) and every other form (also one set):

Periods and commas inside (en-CA/en-US) or outside quotation marks
- Flowchart required for utterances vs. other quoted text†
Double (en-CA/en-US) vs. single at outset

The issue I have daggered above requires human intervention in every case, hence could not be automated even if you wanted to.

Are you also going to deal with adding thin spaces between adjoining quotation marks? How about British style in a quotation whose first word begins with an apostrophe?

Further, in this comment I have chosen to use hyphen instead of nonbreaking hyphen. This could be argued.

André G. Isaak · January 2018

André — I think you mistyped.

Indeed I did! Good catch.

André

Hrant Հրանդ Փափազեան Papazian · January 2018

The ʻokina is for Hawaiian, but we should make an effort to reclaim U+02BC for proper use.

Nick Shinn · January 2018

Quote mark standards, opening and nested

„Afrikaans, Dutch, Polish”
‚Afrikaans, Dutch, Polish’

„Bulgarian, Czech, German, Icelandic, Lithuanian, Slovak, Serbian, Romanian“
‚Bulgarian, Czech, German, Icelandic, Lithuanian, Slovak, Serbian, Romanian‘

»Danish, Croatian«
›Danish, Croatian‹

«Greek, Spanish, Albanian, Switzerland, Turkish»
‹Greek, Spanish, Albanian, Switzerland, Turkish›

‘British’
“British”

“American English, Irish, Portuguese”
‘American English, Irish, Portuguese’

”Finnish, Swedish”
’Finnish, Swedish’

«French»
“French” or ‹French›

«Norwegian»
‘Norwegian’

***

My premise is that <quoteleft> is almost never used in North America, except in nested quotes in body text. However, in display, it appears in error with massive frequency, in lieu of the apostrophe, courtesy of “Smart Quote” algorithms.

Therefore, I would like to replace the <quoteleft> glyph with one of apostrophic shape. But that wouldn’t work in the UK, so I would like to treat them differently—but the <locl> tag doesn’t differentiate Englishes.

Perhaps I should just go ahead, and make different fonts for North America and elsewhere, clearly labelled.

Joe:

The issue I have daggered above requires human intervention in every case, hence could not be automated even if you wanted to.

But it has been automated, the aforementioned “Smart Quote” algorithm. That was well-intentioned, to appease typographers who love their curly quotes, but with an unfortunate side effect.

John Hudson · January 2018

Don't try to solve character-level problems at the glyph level.

Nick Shinn · January 2018

In general, but in this case that is exactly what “Smart Quotes” attempts to do, so I see no reason not to try and bug-fix that, with a less fail-prone kludge, if such a thing is possible, which is what I started this thread to find out—on the assumption that language coding might be useful.

I’m also not convinced that reversed left quote marks are legitimate Unicode characters, they strike me more as an alternate glyph form that may be typeface-specific; for instance, in certain historic usages such as movie title cards. And of course many ATF typefaces of the early 20th century in which they were the norm—a particularly American style.

notdef · January 2018

calt scan everything for glyph sequence "colour" and/or "collywobbles" and toss that left-leaning commie bastard overboard with the tea

Kent Lew · January 2018

In general, but in this case that is exactly what “Smart Quotes” attempts to do, so I see no reason not to try and bug-fix that

No, “Smart Quotes” attempts to solve the character-level problems at the character level, not the glyph/font level. It just doesn’t do it well in all cases.

What you need is a more complete (or competent) Smart Quotes algorithm that is coded in such a way as to be easily integrated into a range of text input environments and on various platforms.

Hrant Հրանդ Փափազեան Papazian · January 2018

John Hudson said:

Don't try to solve character-level problems at the glyph level.

When the barriers are too great, hack. AKA jugaad. Under-represented typography knows this well, like how Armenian became available to computer users many years before Unicode.

John Hudson · January 2018

When the barriers are too great, hack. AKA jugaad. Under-represented typography knows this well, like how Armenian became available to computer users many years before Unicode.

Lots of scripts had varieties of standard, pseudo-standard, and non-standard character encoding schemes pre-Unicode. So? Mixing character space and glyph space was a bad idea then as now, as Adobe expert set fonts demonstrated.

Hrant Հրանդ Փափազեան Papazian · January 2018

Yeah it was really bad Armenians didn't just twiddle their thumbs.
https://en.wikipedia.org/wiki/Jugaad

John Savard · January 2018

Hrant H. Papazian said:

Yeah it was really bad Armenians didn't just twiddle their thumbs.

No, that's not the point. Of course if doing things in the future-proof standards-compliant way is not an option, it's better to hack than to do without.

But when you have the choice, it's better to do things the right way, and people do have that choice now, or at least, this is what he is claiming.

Jasper de Waard · January 2018

Nick Shinn said:

Quote mark standards, opening and nested

„Afrikaans, Dutch, Polish”
‚Afrikaans, Dutch, Polish’

„Bulgarian, Czech, German, Icelandic, Lithuanian, Slovak, Serbian, Romanian“
‚Bulgarian, Czech, German, Icelandic, Lithuanian, Slovak, Serbian, Romanian‘

»Danish, Croatian«
›Danish, Croatian‹

«Greek, Spanish, Albanian, Switzerland, Turkish»
‹Greek, Spanish, Albanian, Switzerland, Turkish›

‘British’
“British”

“American English, Irish, Portuguese”
‘American English, Irish, Portuguese’

”Finnish, Swedish”
’Finnish, Swedish’

«French»
“French” or ‹French›

«Norwegian»
‘Norwegian’
***

I would classify the use of quotes in the abovr scheme in Dutch as old fashioned, most current Dutch media would opt for the 'English' version, and both are correct. My point: these things are subject to fashion, so perhaps ill-suited to code into a typeface.

John Hudson · January 2018

Hrant, you seem to be wilfully ignoring the point, which is not about hacking vs twiddling thumbs, but about where to hack. Armenian text processing pre-Unicode involved using the same 8-bit codes as ANSI and assigning them to Armenian characters, just as was done for dozens of other writing systems. Sometimes that was done within the framework of official standards — e.g. the ISCII encoding standards in India —, and sometimes it was done in an ad hoc way by specific communities (where community might be a user group of a particular computer platform), and sometimes it was done on a font-by-font basis. Obviously, the more standardised the encoding, the better the chances of text interchange and platform interoperability, and its no wonder that the standardised 8-bit encodings tended to become the model for migration to 16-bit codepages, which allowed for fairly easy migration of fonts and documents too. So even when hacking a character encoding solution, there were clearly better and worse ways to do it, better and worse places to apply the hack.

Nick is describing a limitation within the algorithm that converts some character codes to different character codes in certain circumstances. The algorithms don't always produce the correct result (because the circumstances are more complex than the algorithms allow for: there are exceptions to simple rules that the algorithms don't anticipate). So that's the problem in need of a solution. Where should that solution reside?

It seems obvious to me that the font-specific glyph processing level is not a very good place at which to try to solve that problem. It isn't an interoperable solution (in order to be interoperable the same hack would have to be made in all fonts), and it masks rather fixes the problem, because it leaves the incorrect, unwanted character in the text string.

I'm not opposed to a well-considered hack*, but I think the tendency of font makers to try to solve text processing problems in glyph space is that of the proverbial man with a hammer who sees everything as a nail.

_____

* cf. the custom normalisation schema that I developed with Biblical Hebrew scholars and text processing experts to bypass uncorrectable errors in the Unicode canonical combining class assignments for Hebrew marks. [SBL Hebrew User Manual : Appendix B p.21] That's a necessary hack. I was pleasantly surprised recently to discover that its been adopted as an ad hoc standard in a range of software that needs to handle Hebrew text.]

Hrant Հրանդ Փափազեան Papazian · January 2018

Except it doesn't seem that what Nick wants has an officially sanctioned solution. Not unrelated to how the apostrophe has been crippled by bureaucrats with muddled intentions.

Things like the Adobe Expert Sets only seem a bad idea in retrospect; in fact they helped people feasibly implement better typography in their time.

The purity some people seek can become a crutch.

@Jasper de Waard Typefaces are even more subject to fashion than typesetting conventions. Plus they're easy to modify.

Also subject to fashion: "standards".

Nick Shinn · January 2018

“Smart Quotes” attempts to solve the character-level problems at the character level, not the glyph/font level. It just doesn’t do it well in all cases.

True Kent, “Smart Quotes” isn’t a font-level hack, but it does the same thing that we’re not supposed to do in fonts, which is to replace one character by another (not written in the text) that has a glyph deemed more appropriate by a third party (not the document’s author or typographer).

What you need is a more complete (or competent) Smart Quotes algorithm that is coded in such a way as to be easily integrated into a range of text input environments and on various platforms.

Yes, it should be upgraded to utilize grammar and dictionary Intelligence, to work better. And that would also require some kind of language specificity that distinguishes Englishes—to address the issues noted by Joe.

Simon Cozens · January 2018

Let’s talk about expectations. If you’re typesetting a text, would you expect the quotation marks to change direction when you change font?

Typesetters and designers who care about these things can change the character to get the result they want. Providing a weird and unexpected experience for everyone else doesn’t seem to justify any potential benefit.

Hrant Հրանդ Փափազեան Papazian · January 2018

You choose the typeface based on what it can do for your text, not merely for following lowest-common-denominator expectations. And you generally don't switch a typeface unless you chose the one at hand poorly, which means the new one should be given the benefit of the doubt.

Would you expect the ampersand to change from its conventional shape to an "Et"? Does that make the "Et" form necessarily bad? And what if the numerals go from lining to OS?

In the end, if a typeface designer believes that a convention is dysfunctional, following it anyway can become an act of hypocrisy.

John Hudson · January 2018

In the end, if a typeface designer believes that a convention is dysfunctional, following it anyway can become an act of hypocrisy.

Or compromise. 'Purity' cuts both ways.

On the digital text processing vs typography front, I've come to the conclusion that getting the right characters in the string is the sine qua non of accurately representing those characters; ergo, if the wrong character is in the string, the appropriate solution is to replace it with the correct character, not to try to represent the incorrect character with the correct glyph. Not sure why that seems disagreeable to anyone.

I ordered red wine, not white.
I'm sorry sir, just let me add some red food colouring to that glass.

Hrant Հրանդ Փափազեան Papazian · January 2018

To me the sine qua non is what the user sees, not how we made it so.
And nothing lasts forever,

Simon Cozens · January 2018

This whole discussion reminds me of my favourite bit of OpenType code,

sub period space space by period space

If you think fonts should be opinionated about linguistic conventions, you should probably include that one too.

John Hudson · January 2018

[Signing out, since Hrant has once again reduced himself to communicating in slogans. No dialogue to be had here.]

Nick Shinn · January 2018

This discussion, like many today, pits those who believe in correct principles vs. those who believe in correct outcomes, both as best practices.

However, this thread was started to address the practicalities of two very specific situations, namely “Smart Quotes” and reversed left quote marks.

I wondered if one way to deal with these might involve making a distinction between American and other Englishes.

And I’m getting a lot of flak from the principled purists, of whom I would ask, do you have a better idea? Certainly it’s true that as Kent says, “What you need is [a better algorithm]”, but I already know that ain’t gonna happen, and it’s not something I can come up with—but a font hack is.

I was interested to know if there is, technically, any standard that distinguishes different nationalities of English. Joe’s post of January 4th has been the most helpful so far.

I have actually put the apostrophe glyph in the <quoteleft> character, for proprietary display fonts for North American companies that use the fonts in packaging, adverts and posters in the USA, where it is effective in preventing apostrophe boo-boos, and has no down side that I’m aware of.

And I’ve put reversed quote marks in some fonts, coded for English:

<div>&nbsp;language ENG &nbsp;exclude_dflt; # English
sub [quoteleft quotedblleft] by [quoteleft.alt quotedblleft.alt];</div>

Strictly speaking, this is wrong, because it represents one Unicode character by another’s glyph, e.g. Double High-Reversed-9 Quotation Mark (U+201F).

However, this is described as “has same semantic as 201C, but differs in appearance”, which is rather like the relationship between single and double storey /a, and they don’t have separate Unicode points.

Will that be the house red sir, or something special?

Howdy, Stranger!

Quick Links

Categories

Separate language codes for different Englishes

Comments