Special dash things: softhyphen, horizontalbar

Nina Stössinger
Nina Stössinger Posts: 155
edited March 2017 in Technique and Theory
Hi there, I’m doing some character set research and wondering about the following two characters & whether they ought to be supported (and why, and how) – input welcome:

softhyphen (uni00AD), aka shy or ­. The Microsoft Standards have a section on this which reads like use of this is highly application-specific. Is it wise to include this character (I presume, as a component of the regular hyphen*), as it is (or can be) the thing used/displayed for automatic hyphenation in page layout applications? Any other reasons? Do browsers’ implementation of hyphenation display this glyph, or are they fine with the base hyphen glyph? 
Also wondering if this should take a case-specific form if its hyphen/dash friends do.
(* Some Unicode lookup things display this with no outlines, but I wonder if that’s just an implementation question)

horizontalbar (uni2015). This is identical to the emdash in some fonts, and slightly shorter in others. I’m not quite clear on what this does. I’ve seen it referred to as a “quotation dash”, would that suggest anything in particular for its design? Wondering about the usefulness of supporting this at all, and any input on design parameters (including the need/usefulness of a case-sensitive form).

Thanks!
«1

Comments

  • I might like to see horizontalbar as a variant of the emdash: either with the same black body but connecting (if the emdash doesn't, otherwise not connecting); or a longer black body (shorter if the emdash does connect) but with the same set width.

    The softhyphen sounds like it should be identical to the hyphen. Unless it's for Armenian then make it a yentamna.  :->
  • Nick Shinn
    Nick Shinn Posts: 2,216
    Soft Hyphen = Discretionary Hyphen.

    In InDesign: <Type> <Insert Special Character> <Hyphens> <Discretionary Hyphen>

    When you insert it in a word, it looks like nothing happens. But when that word breaks between lines, then the hyphen appears at the pre-determined position.

    I would assume an extra, special character is required, so that the text can be copied to other applications with the same breaking stipulation.
  • Horizontal Bar
    Comments:  quotation dash
    long dash introducing quoted text 

    In my fonts, I make it 150% of the em-dash. In the Unicode charts it is the same size as the em-dash. I don't see much point in including it in fonts if it's identical to the em-dash. Give the user options. 

  • James Puckett
    James Puckett Posts: 1,998
    I think horizontalbar is intended to provide the functionality of 2em and 3em dashes. These were used in metal type to indicate that a word or part of a word was missing or omitted. Some people mention it having no sidebearing so it can be joined up to fill large spaces. I’m not sure if these were the same as a quote dash. 

    Felici shows the 3em dash as being used for indentation bibliography entries, see page 235. Indexes are sometimes indented this way. But most examples on my shelves use em dash. Reed’s History of the Old English Letter Foundries uses what looks like two em dashes. The only 3em dash I found is in The Indic Scripts, Paleographic and Linguistic Perspectives, but it’s typeset digitally, so I think it’s three em dashes in succession.

    My hypothesis is that 2em, 3em, and quote dashes were all rare in the metal type days, and easily confused, so they’re lumped into one unicode entry
  • Kent Lew
    Kent Lew Posts: 944
    Further to what @Khaled Hosny said about U+00AD being a control character, in the Unicode Standard, you will find this description in the chapter on punctuation:
    Soft Hyphen. Despite its name, U+00AD soft hyphen is not a hyphen, but rather an invisible format character used to indicate optional intraword breaks. As described in Section 23.2, Layout Controls, its effect on the appearance of the text depends on the language and script used.
    [Unicode Standard, version 8.0, p.268]
  • The quotation dash is also the normal way, in Norwegian, of indicating quotes. They usually forego other typical quote marks, and tend to open with a dash. As for how they do that, well, that’s another matter. A lot of newspapers and magazines tend to use the en dash, and in print I have seen one that could be the em dash but also plausibly the quotation dash. The real problem is that it can’t be typed normally, and none of my Norwegian peers have ever explicitly missed the glyph. Perhaps it’s another piece of Unicode optimism.
  • John Hudson
    John Hudson Posts: 3,227
    Because of lack of standard advice regarding the length of U+2015, I've tended to literally interpret the name HORIZONTAL BAR, and make it as wide as my /bar glyph is tall. Obviously, this needs the additional observation that my /bar glyph is always full height: extending from the descender to the ascender height (with overshoot), so close to /emdash length.

    Interestingly, U+2015 is included in Windows codepage 1253 Greek. I've not checked to see whether it might be accessible from some deep level of the Greek keyboard layout.
  • Michel Boyer
    Michel Boyer Posts: 120
    edited March 2017
    Interestingly, U+2015 is included in Windows codepage 1253 Greek. I've not checked to see whether it might be accessible from some deep level of the Greek keyboard layout.

    A "find -exec grep" search tells me that x2015 is output by none of the Standard Keyboards coming with SIL Ukelele. On the other hand, if I search (from the right folder) for the character itself, that I can copy from the Character Viewer and eventually paste in a terminal window, I get the following output

    <p>% find . -iname "*.keylayout" -exec grep -l 'output="―"' {} \;</p><p><br></p><p>./Unicode.bundle/Contents/Resources/Greek Polytonic.keylayout</p><p><br></p><p>./Unicode.bundle/Contents/Resources/Greek.keylayout
    </p>

    You had thus guessed correctly.


  • Kent Lew
    Kent Lew Posts: 944
    I've not checked to see whether it might be accessible from some deep level of the Greek keyboard layout.

    Shift-Option-Q (―) on Mac Greek keyboard.

    But I don’t actually see it on the Mac Greek Polytonic keyboard (unless it’s the function of some dead key that I’m not seeing.)
  • Thanks all for the input!
    Kent Lew said:
    Further to what @Khaled Hosny said about U+00AD being a control character, in the Unicode Standard, you will find this description in the chapter on punctuation:
    Soft Hyphen. Despite its name, U+00AD soft hyphen is not a hyphen, but rather an invisible format character used to indicate optional intraword breaks. As described in Section 23.2, Layout Controls, its effect on the appearance of the text depends on the language and script used.
    [Unicode Standard, version 8.0, p.268]
    Oh wait, but would that imply that it should better not be double encoded with hyphen but rather be present as a non printing character? (Sorry, confused)

    Re. quotation dash, I have typeset a grand total of one book that used quotation dashes (a translation from Norwegian whose editor opted to preserve this style), and have found that endashes seemed a little short and emdashes maybe a little long, so I could see how that would imply having a length somewhere in between. Wonder how much it would actually get used though.
    Devil’s Advocate question: Is there any serious downside to just not having this character? It seems relatively uh, nonessential?
  • John Hudson
    John Hudson Posts: 3,227
    edited March 2017
    Oh wait, but would that imply that it should better not be double encoded with hyphen but rather be present as a non printing character? (Sorry, confused)

    You're not alone. I suspect it really doesn't matter at all what a font does with U+00AD: whether it includes it as a duplicate of the hyphen glyph, double-encodes the hyphen glyph, or doesn't support it in the glyph set at all. Any software that implement soft hyphen support is not going to be relying on the presence of a glyph, and won't be displaying a glyph if it is present, but will instead be using the presence of the control character in text to identify a preferred hyphenation point. If the text is hyphenated, then the appropriate hyphen glyph will be displayed (which will depend on the script and language involved).

    That said, Khaled, do you have any experience with working with soft hyphen in text editing software in show-control-character mode? I wonder if this is a situation in which having a visible glyph — as we do for ZWJ and ZWNJ — might be useful?
  • Kent Lew
    Kent Lew Posts: 944
    Nina — As I said, I only double-encode U+00AD in case there is a validation tool that needs the codepoint in order to permit indication of certain codepage coverages. I suspect, as John does, that it doesn’t actually matter what’s in the font or not.
  • My interpretation of the "representation" of U+00AD in the Unicode font chart http://www.unicode.org/charts/PDF/U0080.pdf is that the character has no associated glyph, as all the other characters for which the representation is a dashed square, kind of representation that is used in a fall out font. 
  • John Hudson
    John Hudson Posts: 3,227
    My interpretation of the "representation" of U+00AD in the Unicode font chart http://www.unicode.org/charts/PDF/U0080.pdf is that the character has no associated glyph, as all the other characters for which the representation is a dashed square, kind of representation that is used in a fall out font. 

    Take caution applying this interpretation to other blocks: a dotted square is also used in Unicode code charts to indicate an enclosing character, such as the first six codepoints in the Arabic block.
  • Michel Boyer
    Michel Boyer Posts: 120
    edited March 2017
    >  Take caution applying this interpretation to other blocks

    ... and if I add that the “name” of the character is written in capital letters inside the dashed rectangle? 
  • Nick Shinn
    Nick Shinn Posts: 2,216
    I think it could be useful to include the character.

    This is a somewhat hypothetical premise, but I do recall in the past, working as a graphic designer, situations where a client or product name was broken in a strange place—they don’t like that. And I see words broken in ways that disrupt expectation daily in my newspaper.

    So entering the discretionary hyphen in a document, once, will ensure that the break will be transported with the text when it is copied to other documents.

    For some reason, I am reminded of a small Yorkshire town, Penistone (pronounced pen-iss-tun).


  • John Hudson
    John Hudson Posts: 3,227
    So entering the discretionary hyphen in a document, once, will ensure that the break will be transported with the text when it is copied to other documents.

    Right, but the point of control characters is that they don't need to be displayed under normal circumstances, and hence don't require glyphs to be present in the font. The soft-hyphen in a document is just a code in the text string.

  • Michel Boyer
    Michel Boyer Posts: 120
    edited March 2017
    Nick, in TeX, soft hyphens are written as \- in the source file and I have been using them for years be it only to force breaking a word that the hyphenation dictionary does not know. I guess that your Yorkshire Town's name break pattern is Pen\-istone. Fine. However, TeX fonts do not contain a soft hyphen.

    The question is what is needed in the font to produce the final text. Clearly TeX or InDesign will eventually use those soft hyphens to decide where the word  is to be broken at the end of a line. Does Indesign or LaTeX need a special character to put as hyphen before the line break? No, the hyphen needs to look exactly as the other hyphens that were not triggered by a soft hyphen and what is used in TeX is the hyphen character. The application needs to know it is there in the source file but only needs a hyphen to do its job.

    Can you give me the name of an application that will handle correctly soft hyphens in the source file if the soft hyphen is encoded in the font and that will break if it is not encoded in the font? 

  • Ray Larabie
    Ray Larabie Posts: 1,436
     Penistone would be a terrible Crayola color name.
  • Oh wait, but would that imply that it should better not be double encoded with hyphen but rather be present as a non printing character? (Sorry, confused)

    For a well behaving implementation, it does not matter what you put there as it will be ignored anyway. The double encoding is needed for broken implementations that will try to use the font glyph (e.g. some versions of Google Chrome used the glyph for U+00AD when breaking the line at it, but I think this is fixed now).

    That said, Khaled, do you have any experience with working with soft hyphen in text editing software in show-control-character mode? I wonder if this is a situation in which having a visible glyph — as we do for ZWJ and ZWNJ — might be useful?

    There are not that many applications that has this feature, but the two I’m familiar with, LibreOffice and Scribus, do not use the U+00AD glyph. Scribus actually just draws some predefined shapes for the few “invisible” characters it supports.

    HarfBuzz, however, has an option to preserve Default_Ignorable characters which will just keep whatever glyph the font has for them, but I don’t know any applications that use it.
  • Michel Boyer
    Michel Boyer Posts: 120
    edited March 2017
    If I run the command (from the Mac font tools)
    ftxinstalledfonts -f -U00AD

    I get that (among others) Apple chancery, Chalkboard, Chalkduster, Copperplate, Didot Italic, Hoefler text and Skia do not have the character U+00AD. What can be the consequences?

    I also checked that all the ttf and otf files for which ftxinstalledfonts -f -U00AD gives a YES answer have a contour or a reference for it (by checking U+00AD with the fontforge function isWorthOutputting)

  • > Penistone would be a terrible Crayola color name.

    We once visited Peníscola, Spain. When I called Europcar to reserve a car there, the lady thought I was crank-calling.

  • Kent Lew
    Kent Lew Posts: 944
    Scribus actually just draws some predefined shapes for the few “invisible” characters it supports.

    InDesign also has its own predefined symbols for displaying invisible characters when Show Hidden Characters is activated, independent of what might be encoded in the font.

    FWIW, TextWrangler does appear to actually display the encoded outline for discretionary hyphen U+00AD. So, in Nick’s example above, pasting such a text into TextWrangler makes the discretionary hyphen point visible. As far as I can tell, if one is not encoded, it will use a fallback font. (Not a good reason to draw one, I’m just reporting . . . )

    But this is a text-processing and coding app, not a layout app. Most people don’t go around applying different fonts in TextWrangler. ;-)

  • Nina Stössinger
    Nina Stössinger Posts: 155
    edited April 2017
    Regarding the horizontal bar, I wanted to add that I have since learned that the U+2015 has a different linebreaking behavior in “at least some software” (as per Wikipedia Quotation Dash), in that automatic line breaks directly following this character are suppressed, unlike for the standard dashes. (Thanks to Frode for pointing this out.) I did some quick testing and it seems that Word (for Mac 2011) and TextEdit do appear to honor this distinction whereas current versions of InDesign and Illustrator do not. Hm.

    Abovementioned Wikipedia link also names a whole row of languages that use quotation dashes: Bulgarian, French, Greek, Hungarian, Polish, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Vietnamese. Still curious to hear in which of these (if any) use of the dedicated codepoint is frequent vs. people just using en/emdashes as seems to be the case for Norwegian.
  • Chris Lozos
    Chris Lozos Posts: 1,458
    We once visited Peníscola, Spain.

    That is why in Florida, they changed the spelling, swapping the i with s ;-)
  • [...] Abovementioned Wikipedia link also names a whole row of languages that use quotation dashes: Bulgarian, French, [...]
    Amusingly, my 1975 edition of the “Lexique des règles typographiques en usate à l’imprimerie nationale” states that “Les changements d’interlocuteurs seront marqués par des moins (tirets)”, which means the the French national printing office was using the minus sign for that kind of dash.
  • Still curious to hear in which of these (if any) use of the dedicated codepoint is frequent vs. people just using en/emdashes as seems to be the case for Norwegian.
    That would be the case in Spanish and Portuguese as well, and I guess it’s due to the lack of support for U+2015 in most publishing fonts. In these languages, as a matter of style you can use both U+2014 and U+2013 as quotation dashes, but to avoid the wrong line-breaking behavior of these characters, you may need to add U+2060 next to them.