Determining language support

Jack Jennings · September 2014

I'm working on a Python library that attempts report on language support given a character set.

From an amateur perspective, it seems like language support is composed of:

Supplying a certain set of glyphs that comprise the necessary letters for typesetting
Drawing localized versions of glyph shapes (I'm thinking mainly of Cyrillic here)

My library focuses on the former aspect (as it's more easily detected and isn't file-format dependent). I've gotten to the point that I want to fill in some built-in character sets, and have a couple of rather pedantic questions to float to those of you who actually curate language support lists.

Assuming language support is defined by what can be typeset, is punctuation be considered as a required component? For instance: various quotation marks typesetting standards for languages using the latin alphabet, CJK punctuation, etc.
For languages that only use certain letters for loan words, are these included? For example: does English support include /é for résumé, etc.? Does Italian require /x or /j, though not part of the base alphabet (probably moot since both are part of ASCII)?

I'd be happy to also hear about how anyone else addresses this. And also, realizing that it may be quite subjective and more nuanced than I imagine, about what other issues may come up in developing a programatic interface.

Thomas Phinney · September 2014

1) Yes.
2) Depends on how fundamental the usage is. For example, in English you can leave off the acute accents on résumé, and those accented letters shouldn't be marked as required.

Also, you should be aware that there are several open source projects that already tackle this exact same question.

Fontaine was first released in March 2009. Currently developed fork is PyFontaine: https://github.com/davelab6/pyfontaine

Speakeasy came out in October 2010. https://github.com/typekit/speakeasy

Neither had all that impressive a data collection at first, so I drew on both those sources and added a LOT more data when I was at Extensis. Extensis released that data in April 2013, download it here: https://github.com/Extensis/lang/blob/master/languages.xml

It is not perfect, but it covers over 150 languages and character sets. Extensis uses the data internally with a proprietary scanning tool. It has been integrated back into PyFontaine.

Jack Jennings · September 2014

Good resources, all. I assumed that some sort of system for this had already been created, but didn't turn up any OSS in my cursory search. I'm glad to know that this data already exists out there, and I'll look into leveraging it.

Kent Lew · September 2014

Thomas — I apologize in advance if this is described somewhere obvious, but what are the distinctions “scanning-codepoints” vs “subsetting-codepoints” in the Extensis xml data?

Dave Crossland · September 2014

https://github.com/davelab6/pyfontaine consumes https://github.com/Extensis/lang/blob/master/languages.xml

Thomas Phinney · September 2014

Kent: I developed the Extensis data for two purposes, “scanning” fonts to determine what languages they support, and “subsetting” fonts to support a given language. The subsetting codepoints automatically include all the scanning codepoints, plus any additional ones specified.

Why is this? Well, after much thought and wrestling with character sets early on, I decided to make a distinction between the two degrees of support. There are characters that we would keep in the font if present when subsetting, but that we would not require to report a font as supporting the language in question.

A shining example of this would be any newly-added currency symbol. So, for example, you don't want to require the brand-new Turkish lira symbol to be present to say that a font supports Turkish, but you surely do not want to remove that symbol when subsetting a font to “just Turkish,” either!

Kent Lew · September 2014

Ah, makes sense! I have essentially been making the same distinctions myself; I just didn’t connect those labels with those divisions.

Thanks for clarifying.

Thomas Phinney · September 2014

Also, note that DTL OTMaster has some nice language-reporting capabilities.

Determining language support

Comments

Categories