New Member Here...
Tocharian is part of the large Indo-European language family that, of course, includes English. Written evidence, mostly translations of Buddhist liturgical works, were found in the Tarim basin of Western China early in the 20th century. I have designed a font of Tocharian A. It is approximately 3000 glyphs in size! The large number of glyphs is not due to the presence of logograms like those of Chinese or Maya Hieroglyphs, but the unique alphabetic syllabary, an abugida, and the way they are stacked into phonetic units. I have included three attachments to this post 1: A screenshot of one of 14 (so far…) Font Lab files. 2: An example of a typical Tocharian Buddhist document. 3: A page from a roman / unicode look-up table showing how the glyphs are distributed among the many Font Lab files.
This font project that requires application design solutions rather than font aesthetics. Specifically, bringing an ancient script into the digital age. I do not have the programming expertise to create a simple application to drop glyphs into a word doc (or some such..) to create a mixed font page for academic or pedagogical needs.
I don't know anything about coding or word processing protocol, but just enough font design to get myself into trouble.
I am looking for collaborators, but failing that, any advice, tips, criticism, books, or leads to design groups that could provide me a road map on how to move this project forward.
Sláinte
Charles
Comments
First up: Wow. This is an impressive piece of work.
Now, in terms of road-maps for the future, the first step to anything is to get Tocharian encoded in Unicode. What you've done currently is to map each Tocharian character onto a Latin character, and because there are so many characters in Tocharian, you've had to map different characters to the same Latin character in multiple fonts. And I'm sure you can see the problems with that already; if someone chooses the wrong font, they get the wrong glyph even though the underlying document has not changed. (In fact, if they choose a completely different font, they get Latin!) To make the document interchangeable and not font-dependent, you need a way to uniquely address each Tocharian glyph, without "doubling up" on other codepoints. This will come when Unicode encode Tocharian: each character in the abugida will get its own codepoint allocation. There is a proposal to encode Tocharian already, but it was last visited in 2015 and I'm not sure of the status. I can say for sure that it is not currently being discussed; what I don't know is why not. I will try and find out. (The reason I am interested is, of course, that once Tocharian does get encoded, I will need a font for it in the Noto project, and you've just drawn one...)
Once Tocharian is encoded, a lot of other things get easier. Keyboards can be implemented. Shaping engines get to know the rules for putting glyphs together. But as you can see, that's not going to happen any day soon. Encoding new scripts takes a long time, and goes through a big process of discussion and revision, and that's when a proposal is "active".
From what I can see, you are trying to encode every single conjunct - i.e. to type a kya, you look up what character maps to "kya" atomically, and then type that character in the given font. This gives you a lot of mappings, and it is why you are needing to split the glyph set into multiple fonts. This is not how the script will work when Tocharian is encoded in Unicode. Instead, it will probably work pretty much like any other Brahmic script; i.e. to type a kya, you would type a "k" on your keyboard, the keyboard software would add a codepoint U+11E10 into your document and the font would display a KA, and then you would type something like "f" (which is the usual virama key in Brahmic scripts) and U+11E4F would be added to your document and you would see a visible virama, and then you would type a "y" (U+11E29 for YA), and then the OpenType coding within the font would use OpenType substitution to turn the whole set of three characters into the glyph representing the kya conjunct. (I am using codepoint numbers from the proposal, which is a bit naughty because although they have been reserved for Tocharian, the actual codepoints have not been formally decided yet; this is just for the sake of example.)
This is something you can already do in your "hacked" Latin font. Not every glyph needs to be assigned to a codepoint, but it can be addressed through substitutions in the same way. So you can assign "k" to KA, "f" to virama, "y"' to YA, and not have a codepoint assignment to the "kya" conjunct directly, but instead have a rule inside your font that says "sub k f y by kya;"
A second technical thing to think about is whether you actually need "precomposed" conjunct glyphs for every possible conjunct. I think it should be possible for you to dramatically reduce your glyphset. OpenType has the ability to "attach" two glyphs together at anchor points. So for example, kssa is a ka with a subscript-ssa attached to the bottom of it, kssta is a ka with a subscript-ssa attached to it and a subscript-ta attached to that, and so on. So you don't actually need to draw out all the conjuncts; I know some are irregular and will need drawing atomically as precomposed glyphs, but some are regular and can be formed by attachment. You just draw the subscript-ssa and subscript-tta glyphs, and these two can be used to attach to all the base consonant glyphs to form first-level conjuncts, and also to attach to each other to form deeper conjunct stacks. (This conjunct technique is used well in scripts like Myanmar and Javanese, but not so much in Devanagari or other Indics. I'm not sure why not. Maybe there are lots of irregularities; maybe designers really like having control over what goes where; or maybe foundries are just paid per glyph... You would have to examine it and see to what it extent it would work in Tocharian.) The same may be true for some of your vowel forms as well; where you're just adding strokes to a base character, you can use mark attachment for this.
Basically my advice now would be to study how other similar Brahmic and Indic fonts are put together. I would say that my book would be helpful for this, but I think it doesn't really have as much about Indic as it should have. The various Microsoft script development standards for Indic scripts might be useful, but you would have to think laterally about how they might be applied to Tocharian. And of course, there's always a lot of good information in the Unicode proposal.
Good luck, and please feel free to send me any specific questions.
Thanks for your comment! I appreciate any leads. Nearly all grammatical material is in transliteration with a manuscript or individual aksara photo tacked on for emphasis, although the corpus of manuscripts are available at high enough resolution from CEToM or Berlin/TITUS.
Sláinte
Charles
Thanks again for the post. I found my way to the Noto Font Foundry. Wow! Definitely a “Behind the Green Door” moment for me. Same as when I was a kid and found Dr. Suess’ “On Beyond Zebra”.
Let me ask a few questions / make comments, from “Big Picture” to the nitty-gritty. My observation is that designing a font and application (…this particularly esoteric one under discussion) there is a pre-existing, well-trod, organizational paradigm. Proposals are submitted, Unicode standards are voted on and approved. Project “Managers” (yourself and Michael Everson, so far, already mentioned in the thread) determine priorities. Dr. Hannes Fellner, a Tocharian scholar specializing in epigraphy from U. Vienna has expressed some limited interest in my project, so an academic group is required for their seal of approval. (…there will be devils in the details) Money is raised and spent on coders/programmers. Another script is added to the bullpen of available tools.
Can I be part of all this? Wait for sponsorship or forge ahead by myself?
My goal was to make a practical system to drop glyphs into a Word doc. I’m a bit out of my league.
Here is an example of the type of questions I would ask, if I was to forge alone on this project by myself.
The 14+ separate FontLab files was a practical solution for me to keep track of all the glyphs I created, but was poor decision in hindsight for application design. What needs to be done is to create a single 3500 (or more) glyph Unicode Tocharian compliant file. Is Lee Wison’s Unicode proposal workable? Open Type features allow ligatures. This script is rich in ligatures and they are nearly all are irregular. What is the best way to create this substitution list? Open Type tools as part of FontLab?, Python programming?, something else? This may come across as a request for hand-holding but if I was a design student, how would you advise me?
I sincerely thank all of you for taking time from your busy schedule to correspond with me.
Sláinte
Charles
Dear Mikhail:
Seems to me, bitmapped images and text, together, in a technical document would be “ugly” for lack of a better word. I do not understand the term “political” in this context.
Sláinte
Charles
"So that is why before giving advises I'd ask what is the general plan and how far it goes."...good question.
My goal was to make a practical system to drop glyphs into a Word doc. Step 1 was design the font, sieving for all aksarsa combinations in Tocharian published texts. I define a "practical system" in terms user friendly, look-up table as well as the standard Devanāgarī keyboard key-stroke sequence that are commonly used for non-latin fonts. Tocharian has traditionally been transliteration only. There are some linguistic material that I want to re-edit w/ epigraphy included from the very beginning. I am looking for collaborators, failing that, specific knowledge to solve problems as they arise. Problem 1 is how to facilitate large numbers of ligatures in Fontlab5. Unicode standardization would be iceing, decorations and cherry on the cake.
Sláinte
Charles
Hey Mikhail:
Tocharian is only transliterated. The (mostly) accepted latin transliteration schemata is attachment 4.
The latin table is a glyph to unicode table that tells me where glyphs are located to use previously designed glyphs to construct new glyphs. The table grows as I sieve through new transliterated manuscripts.
A “widget” based on alphabetization was how I envisioned input. For example: l + l would display 10 ligaturen from the list, all ligaturen that began w/ ll- . l + l + y would display just llya, llyā, llyām simultaneously, with just 3 keystrokes ( as there are no llyi, llye, etc. in the ligaturen list so far) No knowledge of keyboard sequence rules just a keyboard mapping of “a” to a, “ā” to A, and “ä” to @ and the widget. Selecting the aksara would drop the character in to the document.
I never knew the visual pinyin input to a Chinese font was called a “helper widget” even though I have used this many times. That one comes with a tiny bitmapped character to assist selection. A nice feature.
Should I be posting on the “FontLab5 for Beginners” thread rather than here? Unicode for Tocharian is not part of the FontLab5 drop-down Unicode Mode, Unicode.org does not have Tocharian, only the provisional code chart in Lee Wilson proposal. So how does one set up the single, 3500 glyph font? Perhaps something about OpenType substitution tables in FontLab5 might help.
“…But ideally you'll need some coding experience or find assistance.”…a truer statement was never typed into a keyboard.
Sláinte
Charles
Don’t think of Unicode standardised encoding as ‘iceing, decorations and cherry on the cake’. A standardised text encoding model is foundational, and without it anything you come up with is only going to be a stop-gap hack to get little pictures of text elements into a document.
That said, preparing, submitting, and awaiting approval of a Unicode encoding proposal takes time, but there are ways you can come up with a working hack in the meantime that can serve as a proof-of-concept for the proposed encoding model. This means thinking about how to get text elements into documents in a way that mirrors the encoding model, whether at the font level or the input level. This typically also has the benefit of making it easier to subsequently convert documents created using the hack to an eventual standard encoding.
I recommend against hijacking Latin codepoints for your input, since this will always create ambiguities at the document level that get in the way of eventual conversion. Since Tocharian is a left-to-right script, you can fairly easily use codepoints from Unicode’s Private Use Area (PUA).
You have a couple of options in how the proceed:
a) Try to model an eventual Unicode encoding model as directly as possible, possibly using the 2015 preliminary proposal as a basis. In this approach, you would use PUA codepoints only for the characters needed for a plain text level encoding, and code OpenType Layout features (under the DFLT script tag) to perform glyph substitutions (and possibly positioning) for the glyph variants, ligatures, vowel sign attachments, etc.
An issue to bear in mind in this would be the repha treatment of a conjunct-initial R–, which as in most Brahmi-derived scripts takes a post-base (superscript in Tocharian) form, that in a standardised OpenType Indic model would involve re-ordering by the shaping engine. Since a PUA-encoded proof-of-concept hack does not have access to shaping engine support, it would be a challenge to get the repha to behave correctly.
What you end up with in this approach is something that very closely approximates an anticipated standardised encoding, and eventual document conversion can be as simple as swapping individual PUA codepoints in the text for individual Unicode Tocharian codepoints.
b) Use PUA to affect a ‘glyph encoding’, i.e. assign a PUA codepoint to every glyph in the font, and then use a custom keyboard driver (e.g. in the open source Keyman framework, although you can also create installable Windows and Mac OS keyboards) that maps from sequences of letters and formatting characters to conjunct ligatures and other glyph variants.
This approach has the benefit of being fairly robust in terms of not relying on active support for OpenType Layout for PUA characters, and producing documents that will render identically, using the PUA encoded font, in different environments. It also facilitates visual order input, which might be helpful in getting repha to work.
What you end up with in this approach is documents that are the digital equivalent of hand-set metal type: there is a one-to-one mapping from PUA codepoints to each graphical unit of the visible text, without any mapping to characters as they would exist in a clean text encoding. This means that document conversion will require constructing such a mapping, which can be complicated if converting from visual order to phonetic order.
Sláinte
Charles
The brāhmī repha in Tocharian is just an “r” of the ligaturen written above the line and provides attachment of the vowel (although I don’t think this this holds true for Sanskrit loans…) See attachment 5. No cognitive dissonance like when one reads Sanskrit. Good example of where a simple look-up table eliminates the complicated keystroke sequence in a typical Devanāgarī keyboard.
Hey John S:
If you look at attachment 1 earlier in thread there is a snip of a latin-unicode table ( attachment 4 also) Characters in bold and underlined are transliterations of the Fremdzeichen eg. –nt or pä. Just like “a” to a, “ā” to A, and “ä” to @ and the widget mentioned above, So, “t” could become T and “p” could become P, advocating a look-up table with a normal English based keyboard and a Devanāgarī keyboard approach.
Thanks for brainstorming about this with me.
Sláinte
Charles
The ‘above the line’ aspect may imply a typical Indic reordering behaviour for repha. The decision will be around whether a) conjunct initial R– behaves differently from other conjunct-initial letters in terms of shaping and b) whether that shaping suggests the possibility of rendering the repha using a combining mark, in which case post-base reordering is likely.
Any new Unicode script with Brahmic shaping behaviours is going to be passed to the Universal Shaping Engine, which means it will be subject to a standardised shaping model derived from the Unicode Indic properties assigned to the characters. This will mean that decisions about how conjunct-initial R– is handled will be ultimately made at the font level, expressed in terms of whether the <rphf> Reph Forms OTL feature is used or not. If <rphf> is used, then the output glyph from that feature will be reordered to the end of the cluster automatically by USE. If the font instead treats R– like any other conjunct-initial letter, then it will not be reordered.
So, yes:
- Use PUA codepoints, and use them in a way that mirrors the codepoint inventory of the Unicode proposal.
- Use a Devanagari-style input mechanism, perhaps making your own keyboard using Keyman or similar, which produces those PUA codepoints.
- Because the script isn't encoded, you can't expect any help from the Universal Shaping Engine today; any re-orderings etc. will need to be done inside the font, which might be a pain.
If I were doing this, I might try a "clever-stupid" approach; instead of trying to do it like an Indic font (which is what you would do if you had support from the shaping engine), just work out what combinations of glyphs produce a conjunct and turn your current lookup tables into a big set of substitution rules. What I mean is:- You hit the "k" key on your Keyman keyboard and it produces the codepoint E010 (11E10 from the proposal; subtract 0x3E00 from the codepoints in the proposal to re-root them into the private use area.)
- In your font, E010 is mapped to a glyph called "ka-tocharian".
- Similarly the "j" key gives you E04F, which is mapped to "virama-tocharian".
- In the calt feature (or similar) of your font, you have a huge set of rules like: "sub ka-tocharian i-tocharian by ki-tocharian; sub ka-tocharian virama-tocharian ka-tocharian by kka-tocharian; sub ka-tocharian virama-tocharian sa-tocharian by ksa-tocharian; ..."
This way you will get a system which you can use today, and which feels similar to the "ideal" system you will get once Tocharian is encoded in Unicode. Once that happens, you can update the codepoint mappings in Keyman and in your font, rewrite the rules to make more use of the help provided by the USE, and you've got a Unicode-compliant Tocharian font.