Tool for language support testing

2»

Comments

  • John Hudson
    John Hudson Posts: 3,536
    Strange, it works for me. Here is the URL:
    http://www.tiro.com/Articles/sylfaen_article.pdf

    The article was originally written for the short-lived ATypI Type journal. In the event, they couldn't afford to print that issue, so it was published online for members only, and then the journal folded. We republished it on our own site later.
  • That works thanks!

    What I'm finding interesting is that thickly featured, hinted and kerned fonts can be subset pretty easily, with, as Thomas points out, the right kind database, into the full of scripts OS like. But then what happens is that there is an occupational spectrum in the use of each of those scripts, much more precisely directing the content of subsets. I think it is this sub-setting that puts the user in the right position to communicate their ideas, and not the preferences of the type designer or decisions of OS.

  • John Hudson
    John Hudson Posts: 3,536
    Occupational usage, yes, but also national, regional, house style and, in the end, individual. The ways in which script use gets filtered first to language use -- with attendant questions regarding naturalised loan words, transcription*, etc. -- and then to particular kinds of documents and then to individual documents, seem to me an inverted funnel: you start out with something complex but unitary -- the script and the system by which it functions generically independent of particular languages and usage --, and then you start multiplying all the variations and exceptions to that system, until by the time you get to the ever-expanding base of the funnel, which is the plane of all currently existing documents, the most you can hope for in terms of subsetting is being able to identify significant clusters in the overall scattershot.


    * This week's edition of 'Word of Mouth' on Radio 4 was about contemporary language use in India, including discussion of the widespread practice of transcribing English words in Indic scripts. This is something I've observed in my efforts to compile conjunct sequence frequency data for Indian languages: almost all electronic corpora are massively polluted with transcribed English words (not necessarily a problem if one is concerned with supporting that kind of language use, but a pain in the neck if trying to define appropriate glyph sets for pre-modern literature).
  • "Occupational usage, yes, but also national, regional, house style and, in the end, individual. "

    I agree with this, all the way to the little bitty whirls of personal style being important. But, in terms of distribution now! the number of glyph repertoire "customizations" is not an issue concerning either the content developer or the reader.

    "...until by the time you get to the ever-expanding base of the funnel, which is the plane of all currently existing documents,..."

    But for a document... the more usual "funnel shape", I think, is inverted such that most content has access to 10-100x the number of glyphs it needs. The point I'm making, is for users to have access to better typography via selection of the occupation of the composer of their type, rather than either their own occupations, or feature by feature like OT, which in 20 yrs., has gotten old. :)
  • John Hudson
    John Hudson Posts: 3,536
    The point I'm making, is for users to have access to better typography via selection of the occupation of the composer of their type
    I'm not sure I understand. Can you give an example?

    It seems to me that with regard to subsetting or filtering of character/glyph support -- and hence of what text content and kinds of typography are possible -- the big challenge is dynamic content. With websites pulling content from sources all over the Net, inviting reader comments, etc. it's pretty much impossible to anticipate text content, and hence control typography in any way that we would traditional recognise as 'composition'. ไม่คุณคิดว่า?
  • John, "it's pretty much impossible to anticipate text content, "

    Anticipate all content, sure. But the job of so many "writers", human or otherwise, is to make contentish things they anticipate their readers will want to devour, or at least recognize. What should we do about all that?

    "...and hence control typography in any way that we would traditional recognise as 'composition'. ไม่คุณคิดว่า?"

    In any way? You have intentionally written something into an audience that doesn't anticipate Extra-Latin. To know the reader(ship) well enough not to violate the prime directives, wrest assured, is covered. So, I am in no way suggesting to odyssey in search of the perfect automated typographic connection between a Swedish housepainter and an Hindi physician, or at least not in the next decade or so.
  • Thomas Phinney
    Thomas Phinney Posts: 3,090
    David asked:

    > Are not most all subsets regardless of script "taking" what we used to call ASCII, as the base "language" beside that for which the subsetting is intended?

    If you are speaking of some hypothetical, maybe so. In terms of our languages.xml file, I don't think I would say "most" are.

    > And ASCII, as I know it, doesn't actually cover English, does it?

    Well, there are certainly plenty of things in our "English" definition that are not within ASCII, including curly quotes and dashes and whatnot. So I would agree with you there.

    John asked:

    "Thomas, how do you approach subsetting of unencoded glyphs, e.g. smallcaps, ligatures, stylistic variants? Are you parsing GSUB tables to track glyphs that map back to character level subset inclusions, or are you relying on glyph name parsing?"

    We used to rely on glyph name parsing, but switched to parsing GSUB tables a while back. Even among fairly high-quality foundries, hoping they would rigorously follow Adobe glyph-naming guidelines turned out to be in vain.
  • John Hudson
    John Hudson Posts: 3,536
    edited April 2013
    I think parsing GSUB is a better idea anyway, although for subsetting purposes name conventions would be easier to make work universally than they are for Acrobat text reconstruction. I've basically given up on trying to implement Adobe glyph naming rules for Indic fonts: when you have a single glyph that represents non-sequential characters you can easily end up with ambiguous text reconstruction results.
  • I find Latin Plus very useful. It's pity it hasn't a Cyrillic section.
  • As Pablo mentioned in January 2013, the language section of DTL OTMaster’s consistency checker provides an actual list of supported languages. This includes Cyrillic and much more.


  • To approach this another way: does anyone know of any resource, online or off, that lists the glyph/diacritic requirements of Latin-based languages, broken out by language? I feel like MS must have this somewhere in their typography site, but I can't find it.
    http://www.urtd.net/x/cod/
  • Unfortunately I am not as rich as Dave and hence I cannot offer OTM for free or convert it to a free online service. However, for those who cannot wait for Dave’s online tool to work and also could use some of the other functionality OTM offers, I will bravely and stubbornly remain committed to offering DTL’s flagship tool.
  • So, in all this time, nobody has mentioned the Unicode CLDR data, which includes exemplar character data for many languages?
    http://cldr.unicode.org/translation/characters#TOC-Exemplar-Characters

  • Peter, I think that data is what is powering these tools. Frank, is this what your tool uses? 
  • What ever became of PyFontaine, @Dave Crossland ?

  • John Hudson
    John Hudson Posts: 3,536
    edited October 2017
    CLDR is a foundation for a lot of work in this area. But CLDR isn't always reliable in itself, because just knowing what characters are used in a language doesn't tell you, for instance, whether the form of those characters used is appropriate, or whether the font provides adequate shaping information for those characters for a particular language (e.g. appropriate conjunct forms in an Indic font).

    The problem is similar to that which the 'Design languages' and 'Supported languages' in the new OT 'meta' table seeks to address. CLDR is a good first step in identifying a minimal level of language support in terms of character set coverage, but it isn't sufficient to accurately identify the level of language support beyond that, let alone the design language(s) of a font.
  • The character set page in Underware site is also very useful. You can find some interesting data in every character, validate your font, and easily see all the different shapes of one character in their own fonts http://www.underware.nl/latin_plus/character_set/
  • The Charset Checker by Alphabet Type is a web service that lets you check your font against a customizable set of languages. You can also use their Charset Builder to generate encodings with the desired language support. Their data is also based on the Unicode CLDR.
  • With my Unicode officer hat on: It's good to hear that font tools are finding the CLDR data useful.

    Btw, I agree with John: it's a useful first step, but not the whole story.