Tool for language support testing

Jure Kožuh · January 2013

Hello

Does anybody know a tool that would check a font file and define which languages the font supports?

Karl Stange · January 2013

All of the major font editing tools (FontLab, Glyphs. RoboFont, FontForge) will give you this ability, as will font management software such as FontExplorer and Suitcase Fusion/UTS.

Pablo Impallari's online font testing tool is incredibly useful in this respect as well:

http://www.impallari.com/testing/

Craig Eliason · January 2013

Glyphs app is good at this for fonts it can open (or for fonts developed with it)

, though I don't know if that's the kind of tool you're looking for

Jure Kožuh · January 2013

Hey thank for the response, I was thinking about a tool that could give you an actual list of supported languages. Not just the definition of basic, western EU, ... or maybe I need to find lists of languages that fit into Basic, Western European, Central Europeana and South Eastern European (the definitions in Glyphs).

[Deleted User] · January 2013

The user and all related content has been deleted.

Jure Kožuh · January 2013

This is the default OSX font app you are talking about?

[Deleted User] · January 2013

The user and all related content has been deleted.

Jure Kožuh · January 2013

Will need to reinstall, thank you!

James Puckett · January 2013

IN FontBook you can see such a list when you select "Show Font Info" in the Preview menu

FontBook just builds that list based on whatever codespages the font claims to support. I once checked off Arabic and Fontbook claimed the font supported it. It also claims most Latin fonts support Greek because font apps check that one off by default.

[Deleted User] · January 2013

The user and all related content has been deleted.

Max Phillips · January 2013

To approach this another way: does anyone know of any resource, online or off, that lists the glyph/diacritic requirements of Latin-based languages, broken out by language? I feel like MS must have this somewhere in their typography site, but I can't find it.

George Thomas · January 2013

Here are some interesting links that might work for you:

http://www.eki.ee/letter/chardata.cgi?lang=de+German&script=latin

http://www.evertype.com/alphabets/ (downloadable PDFs)

https://github.com/typekit/speakeasy (source for Speakeasy)

I also found mention of John Hudson's work with MS on Sylfaen c. 2000, later revised into that could be a good reference font.

George Thomas · January 2013

Here's the correct link for the main page of my first reference:

http://www.eki.ee/letter/

Max Phillips · January 2013

Thanks, George. That Eesti Keele link is tremendous.

James Hultquist-Todd · January 2013

Some more resources:
omniglot.com/writing/langalph.htm#latin
en.wikipedia.org/wiki/Alphabets_derived_from_the_Latin

I’d suggest not relying solely on only one of the sources mentioned either by myself or Max Phillips, as I found a few discrepancies in a few of the less common orthographies between various sources.

Depending on the orthography, I’d recommend looking it up in at least two or three of the sources just to make sure you’ve covered it completely.

Stephen Coles · January 2013

http://diacritics.typo.cz/index.php?id=49

Max Phillips · January 2013

Thanks, James and Stephen!

Georg Seifert · January 2013

There is a small tool on unicode.org that can actually examine fonts: www.unifont.org/fontaine

James Puckett · January 2013

Given that this question keeps coming up maybe we could split up the work of doing something about it. A group of people could each pick a chunk of the latin languages listed in Omniglot and plug them into a spreadsheet, which we could post online as a CSV file.

PabloImpallari · January 2013

There is a small tool on unicode.org that can actually examine fonts: www.unifont.org/fontaine

Dave Crossland and Vitaly Volkov are working on a python version of it
https://github.com/davelab6/pyfontaine

I was thinking about a tool that could give you an actual list of supported languages

DTL OT Master does that

Thomas Phinney · January 2013

Besides code to analyze the font, the other problem is the quantity and quality of the language data feeding into it.

We've been working on this problem at Extensis as part of WebINK's new dynamic subsetting infrastructure. I've been building the language data files myself, after we first incorporated every definition from Fontaine and Speakeasy, then WinANSI, MacRoman, Adobe's standard character set definitions, and then adding still more. The Latin and Cyrillic coverage is increasingly good.

We will be releasing our data file as open source.

The data structure is a slightly modified version of Fontaine's. It would be trivial to modify it for use in Fontaine. It could be converted for use in Speakeasy with a bit more work.

I've mentioned this in passing, at least privately to one or two folks, but as it is getting pretty large and increasingly extensive, and has had many corrections, it seems to be nearly time to release it.

Dave Crossland · January 2013

@tphinney AWESOME

Should be easy to have pyFontaine operate on whatever data you publish

Thomas Phinney · January 2013

Yes, I hope so. That was the idea.

BTW, we did three things with our data structure. One was to put everything in a single file. Having a zillion separate files was just getting too unwieldy.

We also invoke the notion of a "parent" character set. So we are mostly concerned with what characters are required beyond that base set. So Latin-based* languages generally take English as a "parent" as it has a basic character set without any accented letters, and most Latin-based languages drop at most a very few of those letters for their alphabet.

* Yes, I know that many languages using the Latin writing system are not in fact based on Latin. I'm using "Latin-based" as a shorthand for "languages written with the Latin writing system." No offense intended to any language.

The third and most interesting change was an idea I had when wrestling with the problem of using a single data set for a couple of distinct purposes: the potential for two levels of codepoint coverage for any language. The base level is the codepoints it must have for us to claim it has language support. Then there are additional codepoints which we consider "nice to have," such that if you are doing something like subsetting a font down for that language coverage, you would want to include them. Quite possibly you'd want to include them when making a new font as well.

As an example, you might not require the new hryvni currency symbol to be present to say that a font supports Ukrainian, but you'd sure want to include it when subsetting a font down for just Ukrainian, or when building a new Ukrainian-supporting font today.

As you might guess from the above example, making reasonable data for each language includes thinking about things such as currency symbols, quote marks and other characters that were not always considered in the Fontaine and Speakeasy data sets.

BTW, here is what the Ukrainian entry currently looks like, for those who are curious about the data structure:


<language name="Ukrainian" abbreviated-name="UKR" parent="Cyrillic">
<scanning-codepoints>
0x02BC,0x0404,0x0406,0x0407,0x0454,
0x0456,0x0457,0x0490,0x0491
</scanning-codepoints>
<subsetting-codepoints>
0x20B4
</subsetting-codepoints>
</language>

Dave Crossland · January 2013

eek, full XML. Would be nicer as JSON, no?

Thomas Phinney · January 2013

Reviewing the Fontaine and Speakeasy data formats again (it's been a long time since we started on this project!) I am realizing that we are pretty much our own animal now. Our starting point was actually closer to the Speakeasy format....

Except they mostly expressed all their Unicode values in decimal, and we went with hexadecimal for compatibility with... well just about every other font-related tool on the planet. Our format actually allows decimal as well, we just mostly avoid it.

Thomas Phinney · April 2013

I wanted to say that we finally released the data file last week!

It's linked (and described) in this blog post: http://blog.webink.com/custom-font-subsetting-for-faster-websites/

John Hudson · April 2013

With regard to a tool for mapping glyph-to-language, it might be interesting for anyone considering this to review the history of the WRIT project (Microsoft & Tiro, 1997–98, presented at ATypI Lyon, originally published ATypI journal) .

As Thomas says, the big issue is the quantity and quality of data.

Thomas Phinney · April 2013

Not to mention that there are some intriguing questions to be asked about where to draw the line, which are likely dependent on the use cases one has for the data.

Deleted Account · April 2013

I'm very curious about proper templating for script support, which blends into proper templating for registered feature support per script, which leads to big fonts that need subsetting. And also, as someone who has lots of fonts floating about, i am interested in emerging merging techniques.

I don't think any of the lists of glyphs by script is worth much unless it is either coming from Apple or MS, or a specific document in use, though the other lists are always cool?

What Extensis is doing sounds pretty neat! On this one issue Thomas reports: "So Latin-based* languages generally take English as a "parent"..."

Are not most all subsets regardless of script "taking" what we used to call ASCII, as the base "language" beside that for which the subsetting is intended? And ASCII, as I know it, doesn't actually cover English, does it?

Also, I'm sort of getting the idea that subsetting depends on the size of use, the script, then matched to one of several levels of composition, if you follow. That way one can divide "what is needed" from "what would be nice" much more definitively?

John Hudson · April 2013

Thomas, how do you approach subsetting of unencoded glyphs, e.g. smallcaps, ligatures, stylistic variants? Are you parsing GSUB tables to track glyphs that map back to character level subset inclusions, or are you relying on glyph name parsing?

Deleted Account · April 2013

John, the "WRIT project" link is not working for me.

Howdy, Stranger!

Quick Links

Categories

Tool for language support testing

Comments