I'm working on a Python library that attempts report on language support given a character set.
From an amateur perspective, it seems like language support is composed of:
My library focuses on the former aspect (as it's more easily detected and isn't file-format dependent). I've gotten to the point that I want to fill in some built-in character sets, and have a couple of rather pedantic questions to float to those of you who actually curate language support lists.
I'd be happy to also hear about how anyone else addresses this. And also, realizing that it may be quite subjective and more nuanced than I imagine, about what other issues may come up in developing a programatic interface.
Comments
2) Depends on how fundamental the usage is. For example, in English you can leave off the acute accents on résumé, and those accented letters shouldn't be marked as required.
Also, you should be aware that there are several open source projects that already tackle this exact same question.
Fontaine was first released in March 2009. Currently developed fork is PyFontaine: https://github.com/davelab6/pyfontaine
Speakeasy came out in October 2010. https://github.com/typekit/speakeasy
Neither had all that impressive a data collection at first, so I drew on both those sources and added a LOT more data when I was at Extensis. Extensis released that data in April 2013, download it here: https://github.com/Extensis/lang/blob/master/languages.xml
It is not perfect, but it covers over 150 languages and character sets. Extensis uses the data internally with a proprietary scanning tool. It has been integrated back into PyFontaine.
Why is this? Well, after much thought and wrestling with character sets early on, I decided to make a distinction between the two degrees of support. There are characters that we would keep in the font if present when subsetting, but that we would not require to report a font as supporting the language in question.
A shining example of this would be any newly-added currency symbol. So, for example, you don't want to require the brand-new Turkish lira symbol to be present to say that a font supports Turkish, but you surely do not want to remove that symbol when subsetting a font to “just Turkish,” either!
Thanks for clarifying.