Tool for language support testing

Jure Kožuh
Jure Kožuh Posts: 5
edited January 2013 in Technique and Theory
Hello

Does anybody know a tool that would check a font file and define which languages the font supports?
«1

Comments

  • All of the major font editing tools (FontLab, Glyphs. RoboFont, FontForge) will give you this ability, as will font management software such as FontExplorer and Suitcase Fusion/UTS.

    Pablo Impallari's online font testing tool is incredibly useful in this respect as well:

    http://www.impallari.com/testing/
  • Glyphs app is good at this for fonts it can open (or for fonts developed with it)
    image, though I don't know if that's the kind of tool you're looking for
  • Hey thank for the response, I was thinking about a tool that could give you an actual list of supported languages. Not just the definition of basic, western EU, ... or maybe I need to find lists of languages that fit into Basic, Western European, Central Europeana and South Eastern European (the definitions in Glyphs).

  • The user and all related content has been deleted.
  • This is the default OSX font app you are talking about?
  • The user and all related content has been deleted.
  • Will need to reinstall, thank you!
  • IN FontBook you can see such a list when you select "Show Font Info" in the Preview menu
    FontBook just builds that list based on whatever codespages the font claims to support. I once checked off Arabic and Fontbook claimed the font supported it. It also claims most Latin fonts support Greek because font apps check that one off by default.
  • The user and all related content has been deleted.
  • To approach this another way: does anyone know of any resource, online or off, that lists the glyph/diacritic requirements of Latin-based languages, broken out by language? I feel like MS must have this somewhere in their typography site, but I can't find it.
  • Here are some interesting links that might work for you:

    http://www.eki.ee/letter/chardata.cgi?lang=de+German&script=latin

    http://www.evertype.com/alphabets/ (downloadable PDFs)

    https://github.com/typekit/speakeasy (source for Speakeasy)

    I also found mention of John Hudson's work with MS on Sylfaen c. 2000, later revised into that could be a good reference font.
  • Here's the correct link for the main page of my first reference:

    http://www.eki.ee/letter/
  • Thanks, George. That Eesti Keele link is tremendous.
  • Some more resources:
    omniglot.com/writing/langalph.htm#latin
    en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin

    I’d suggest not relying solely on only one of the sources mentioned either by myself or Max Phillips, as I found a few discrepancies in a few of the less common orthographies between various sources.

    Depending on the orthography, I’d recommend looking it up in at least two or three of the sources just to make sure you’ve covered it completely.
  • Thanks, James and Stephen!
  • There is a small tool on unicode.org that can actually examine fonts: www.unifont.org/fontaine
  • Given that this question keeps coming up maybe we could split up the work of doing something about it. A group of people could each pick a chunk of the latin languages listed in Omniglot and plug them into a spreadsheet, which we could post online as a CSV file.
  • There is a small tool on unicode.org that can actually examine fonts: www.unifont.org/fontaine
    Dave Crossland and Vitaly Volkov are working on a python version of it
    https://github.com/davelab6/pyfontaine
    I was thinking about a tool that could give you an actual list of supported languages
    DTL OT Master does that

  • @tphinney AWESOME :) Should be easy to have pyFontaine operate on whatever data you publish :)
  • Thomas Phinney
    Thomas Phinney Posts: 2,883
    edited January 2013
    Yes, I hope so. That was the idea.

    BTW, we did three things with our data structure. One was to put everything in a single file. Having a zillion separate files was just getting too unwieldy.

    We also invoke the notion of a "parent" character set. So we are mostly concerned with what characters are required beyond that base set. So Latin-based* languages generally take English as a "parent" as it has a basic character set without any accented letters, and most Latin-based languages drop at most a very few of those letters for their alphabet.

    * Yes, I know that many languages using the Latin writing system are not in fact based on Latin. I'm using "Latin-based" as a shorthand for "languages written with the Latin writing system." No offense intended to any language.

    The third and most interesting change was an idea I had when wrestling with the problem of using a single data set for a couple of distinct purposes: the potential for two levels of codepoint coverage for any language. The base level is the codepoints it must have for us to claim it has language support. Then there are additional codepoints which we consider "nice to have," such that if you are doing something like subsetting a font down for that language coverage, you would want to include them. Quite possibly you'd want to include them when making a new font as well.

    As an example, you might not require the new hryvni currency symbol to be present to say that a font supports Ukrainian, but you'd sure want to include it when subsetting a font down for just Ukrainian, or when building a new Ukrainian-supporting font today.

    As you might guess from the above example, making reasonable data for each language includes thinking about things such as currency symbols, quote marks and other characters that were not always considered in the Fontaine and Speakeasy data sets.

    BTW, here is what the Ukrainian entry currently looks like, for those who are curious about the data structure:

    <language name="Ukrainian" abbreviated-name="UKR" parent="Cyrillic"> <scanning-codepoints> 0x02BC,0x0404,0x0406,0x0407,0x0454, 0x0456,0x0457,0x0490,0x0491 </scanning-codepoints> <subsetting-codepoints> 0x20B4 </subsetting-codepoints> </language>
  • eek, full XML. Would be nicer as JSON, no? :)
  • Reviewing the Fontaine and Speakeasy data formats again (it's been a long time since we started on this project!) I am realizing that we are pretty much our own animal now. Our starting point was actually closer to the Speakeasy format....

    Except they mostly expressed all their Unicode values in decimal, and we went with hexadecimal for compatibility with... well just about every other font-related tool on the planet. Our format actually allows decimal as well, we just mostly avoid it.
  • Thomas Phinney
    Thomas Phinney Posts: 2,883
    I wanted to say that we finally released the data file last week!

    It's linked (and described) in this blog post: http://blog.webink.com/custom-font-subsetting-for-faster-websites/
  • John Hudson
    John Hudson Posts: 3,186
    With regard to a tool for mapping glyph-to-language, it might be interesting for anyone considering this to review the history of the WRIT project (Microsoft & Tiro, 1997–98, presented at ATypI Lyon, originally published ATypI journal) .

    As Thomas says, the big issue is the quantity and quality of data.
  • Thomas Phinney
    Thomas Phinney Posts: 2,883
    Not to mention that there are some intriguing questions to be asked about where to draw the line, which are likely dependent on the use cases one has for the data.
  • I'm very curious about proper templating for script support, which blends into proper templating for registered feature support per script, which leads to big fonts that need subsetting. And also, as someone who has lots of fonts floating about, i am interested in emerging merging techniques.

    I don't think any of the lists of glyphs by script is worth much unless it is either coming from Apple or MS, or a specific document in use, though the other lists are always cool?

    What Extensis is doing sounds pretty neat! On this one issue Thomas reports: "So Latin-based* languages generally take English as a "parent"..."

    Are not most all subsets regardless of script "taking" what we used to call ASCII, as the base "language" beside that for which the subsetting is intended? And ASCII, as I know it, doesn't actually cover English, does it?

    Also, I'm sort of getting the idea that subsetting depends on the size of use, the script, then matched to one of several levels of composition, if you follow. That way one can divide "what is needed" from "what would be nice" much more definitively?
  • John Hudson
    John Hudson Posts: 3,186
    Thomas, how do you approach subsetting of unencoded glyphs, e.g. smallcaps, ligatures, stylistic variants? Are you parsing GSUB tables to track glyphs that map back to character level subset inclusions, or are you relying on glyph name parsing?
  • John, the "WRIT project" link is not working for me.