Isolated combining marks

What code sequences should we input, and what support should fonts supply?  I think spacing and non-spacing marks may need to be addressed separately.  For example, manual line-breaking may separate a visually following mark from its base letter, so what has the responsibility of not inserting white space for the typesetter's <NBSP, ZWJ, mark>?  (One may be trying to reproduce the line-breaking and phrase boundaries of a manuscript written scriptio continua.)  There may be similar problems with preceding marks if the trailing edges of the lines are to be set flush.

Line breaking can also apply to the holes used to string a document together. e.g. the holes punched in palm-leaf documents.
Tagged:

Comments

  • John Hudson
    John Hudson Posts: 3,227
    There’s not a lot one can do with regard to mitigating manual linebreaks and resulting impact on shaping engine behaviour for isolated mark characters (which may or may not be marks at the glyph level, i.e. classified as such in an OpenType GDEF table). Shaping engines may dynamically insert the dotted circle mark carrier (U+25CC) if they consider the isolated mark to be invalid without a base. So, e.g. the iikar vowel sign in Devanagari may display with the dotted circle if a manual linebreak were to leave it isolated without a base at the beginning of a line ी.

    There is no standard way to suppress the display of the dotted circle in that situation, because it is being applied at the script shaping level by the shaping engine. Some common—non script specific—marks may display without the dotted circle because Unicode records spacing duplicates that have canonical equivalences to the mark character following a space character, e.g. ́ (U+0020 + U+0301). But shaping engine cluster validation is not defined by Unicode (and nor is it entirely consistent across shaping engine implementations).

    Manual linebreaking can easily subvert normal orthographic rules. In the case of Indic scripts, linebreaking should never happen within an orthographic unit (cluster), and if a user inserts a linebreak within a cluster, in effect two clusters are created: one at the end of a line and one at the beginning of the next, each subject to independent cluster validation. In some cases, formatting control characters like ZWJ or ZWNJ can be used to affect the display of some components of the clusters, but this generally applies only to the consonant letters, e.g. one could use ZWJ to force display of a half-form at the end of one line, or of a phalaa (subscript or postscript) form at the beginning of the next like, but isolated marks are almost certain to trigger invalid cluster handling.

    _____

    I don’t understand your comment about punched holes: physical artefacts of the binding of a document are not part of the text, so in what sense can linebreaking be applied to them?
  • RichardW
    RichardW Posts: 100
    From your example of Devanagari iikara, it's even more complicated, because it would seem that there are at least two categories of combining marks - there are those like NEW TAI LUE VOWEL SIGN E and NEW TAI LUE VOWEL SIGN AA where a change of coding can promote them from marks to letters.  I believe it is this sort that are more likely to be separated from their base consonants, and separated they certainly are in the Lanna script.

    You say that linebreaking should not occur within orthographic syllables in Indic scripts.  That may be good modern practice, but to take an example I am familiar with, the Lanna script, the encoding proposal [url=https://www.unicode.org/L2/L2007/07007r-n3207r-lanna.pdf]L2/07-007[/url] says, "Opportunities for linebreaking are lexical, but a linebreak may not be inserted between a base letter and and a following combining mark.", but all four examples from old manuscripts (Figures 9a, 9b, 10a and 10b) have a left matra separated from its base letter by a linebreak.  On the other hand, only Figure 9b has a line starting with a right matra.

    The standard method of displaying an isolated combining mark is to apply it to NBSP, but these 'general base' characters may have problems at the starts of paragraphs because of impatient script itemisation that does not consider the script of the combining marks.  Rendering systems seem to have been getting better.  I have seen reference to Microsoft requiring NBSP + ZWJ, but I can't find them now, and that may be an obsolete requirement.  Internet Explorer 11 on Windows 10 using the USE supports neither.

    The punched holes of palm leaf manuscripts occur a long way from the edges of the text, and can break the flow of text just as the left and right ends of the pages do.  (Palm leaf manuscripts are normally used landscape, though 'panorama' might be a better word.)  Google images will quickly yield examples.  Sometimes columns are used so that the holes are surrounded by text.