Tocharian A Font and Application

Charles · January 2023

New Member Here...

Tocharian is part of the large Indo-European language family that, of course, includes English. Written evidence, mostly translations of Buddhist liturgical works, were found in the Tarim basin of Western China early in the 20th century. I have designed a font of Tocharian A. It is approximately 3000 glyphs in size! The large number of glyphs is not due to the presence of logograms like those of Chinese or Maya Hieroglyphs, but the unique alphabetic syllabary, an abugida, and the way they are stacked into phonetic units. I have included three attachments to this post 1: A screenshot of one of 14 (so far…) Font Lab files. 2: An example of a typical Tocharian Buddhist document. 3: A page from a roman / unicode look-up table showing how the glyphs are distributed among the many Font Lab files.

This font project that requires application design solutions rather than font aesthetics. Specifically, bringing an ancient script into the digital age. I do not have the programming expertise to create a simple application to drop glyphs into a word doc (or some such..) to create a mixed font page for academic or pedagogical needs.

I don't know anything about coding or word processing protocol, but just enough font design to get myself into trouble.

I am looking for collaborators, but failing that, any advice, tips, criticism, books, or leads to design groups that could provide me a road map on how to move this project forward.

Sláinte

Charles

Simon Cozens · January 2023

Hi Charles! Just to introduce myself, I'm a font engineer working with Google on the Noto project, which aims to produce a font for every writing system in Unicode, living and historical. So what you say about bringing ancient scripts into the digital age resonates with me.

First up: Wow. This is an impressive piece of work.

Now, in terms of road-maps for the future, the first step to anything is to get Tocharian encoded in Unicode. What you've done currently is to map each Tocharian character onto a Latin character, and because there are so many characters in Tocharian, you've had to map different characters to the same Latin character in multiple fonts. And I'm sure you can see the problems with that already; if someone chooses the wrong font, they get the wrong glyph even though the underlying document has not changed. (In fact, if they choose a completely different font, they get Latin!) To make the document interchangeable and not font-dependent, you need a way to uniquely address each Tocharian glyph, without "doubling up" on other codepoints. This will come when Unicode encode Tocharian: each character in the abugida will get its own codepoint allocation. There is a proposal to encode Tocharian already, but it was last visited in 2015 and I'm not sure of the status. I can say for sure that it is not currently being discussed; what I don't know is why not. I will try and find out. (The reason I am interested is, of course, that once Tocharian does get encoded, I will need a font for it in the Noto project, and you've just drawn one...)

Once Tocharian is encoded, a lot of other things get easier. Keyboards can be implemented. Shaping engines get to know the rules for putting glyphs together. But as you can see, that's not going to happen any day soon. Encoding new scripts takes a long time, and goes through a big process of discussion and revision, and that's when a proposal is "active".

From what I can see, you are trying to encode every single conjunct - i.e. to type a kya, you look up what character maps to "kya" atomically, and then type that character in the given font. This gives you a lot of mappings, and it is why you are needing to split the glyph set into multiple fonts. This is not how the script will work when Tocharian is encoded in Unicode. Instead, it will probably work pretty much like any other Brahmic script; i.e. to type a kya, you would type a "k" on your keyboard, the keyboard software would add a codepoint U+11E10 into your document and the font would display a KA, and then you would type something like "f" (which is the usual virama key in Brahmic scripts) and U+11E4F would be added to your document and you would see a visible virama, and then you would type a "y" (U+11E29 for YA), and then the OpenType coding within the font would use OpenType substitution to turn the whole set of three characters into the glyph representing the kya conjunct. (I am using codepoint numbers from the proposal, which is a bit naughty because although they have been reserved for Tocharian, the actual codepoints have not been formally decided yet; this is just for the sake of example.)

This is something you can already do in your "hacked" Latin font. Not every glyph needs to be assigned to a codepoint, but it can be addressed through substitutions in the same way. So you can assign "k" to KA, "f" to virama, "y"' to YA, and not have a codepoint assignment to the "kya" conjunct directly, but instead have a rule inside your font that says "sub k f y by kya;"

A second technical thing to think about is whether you actually need "precomposed" conjunct glyphs for every possible conjunct. I think it should be possible for you to dramatically reduce your glyphset. OpenType has the ability to "attach" two glyphs together at anchor points. So for example, kssa is a ka with a subscript-ssa attached to the bottom of it, kssta is a ka with a subscript-ssa attached to it and a subscript-ta attached to that, and so on. So you don't actually need to draw out all the conjuncts; I know some are irregular and will need drawing atomically as precomposed glyphs, but some are regular and can be formed by attachment. You just draw the subscript-ssa and subscript-tta glyphs, and these two can be used to attach to all the base consonant glyphs to form first-level conjuncts, and also to attach to each other to form deeper conjunct stacks. (This conjunct technique is used well in scripts like Myanmar and Javanese, but not so much in Devanagari or other Indics. I'm not sure why not. Maybe there are lots of irregularities; maybe designers really like having control over what goes where; or maybe foundries are just paid per glyph... You would have to examine it and see to what it extent it would work in Tocharian.) The same may be true for some of your vowel forms as well; where you're just adding strokes to a base character, you can use mark attachment for this.

Basically my advice now would be to study how other similar Brahmic and Indic fonts are put together. I would say that my book would be helpful for this, but I think it doesn't really have as much about Indic as it should have. The various Microsoft script development standards for Indic scripts might be useful, but you would have to think laterally about how they might be applied to Tocharian. And of course, there's always a lot of good information in the Unicode proposal.

Good luck, and please feel free to send me any specific questions.

Peter Baker · January 2023

Wow. We briefly touched on Tocharian long ago in a grad. school linguistics course, but only in transliteration: I've never seen the script before. It's beautiful, and you've done impressive work with it.

I totally agree with Simon about the necessity of getting the script into Unicode. I suspect that Lee Wilson's proposal has languished because of the scale of the undertaking. The great guru of archaic and minority languages for Unicode is Michael Everson, who has the knowledge and the influence to get a proposal through the appropriate committee (or committees). He knows what goes into a Unicode proposal: he's done a great many of them. I'm going to guess that he'd be interested in helping to get Tocharian into the standard. If you message me, I'll send you his email.

Charles · January 2023

Hey Peter:
Thanks for your comment! I appreciate any leads. Nearly all grammatical material is in transliteration with a manuscript or individual aksara photo tacked on for emphasis, although the corpus of manuscripts are available at high enough resolution from CEToM or Berlin/TITUS.

Sláinte
Charles

Charles · January 2023

Thanks again for the post. I found my way to the Noto Font Foundry. Wow! Definitely a “Behind the Green Door” moment for me. Same as when I was a kid and found Dr. Suess’ “On Beyond Zebra”.

Let me ask a few questions / make comments, from “Big Picture” to the nitty-gritty. My observation is that designing a font and application (…this particularly esoteric one under discussion) there is a pre-existing, well-trod, organizational paradigm. Proposals are submitted, Unicode standards are voted on and approved. Project “Managers” (yourself and Michael Everson, so far, already mentioned in the thread) determine priorities. Dr. Hannes Fellner, a Tocharian scholar specializing in epigraphy from U. Vienna has expressed some limited interest in my project, so an academic group is required for their seal of approval. (…there will be devils in the details) Money is raised and spent on coders/programmers. Another script is added to the bullpen of available tools.

Can I be part of all this? Wait for sponsorship or forge ahead by myself?

My goal was to make a practical system to drop glyphs into a Word doc. I’m a bit out of my league.

Here is an example of the type of questions I would ask, if I was to forge alone on this project by myself.

The 14+ separate FontLab files was a practical solution for me to keep track of all the glyphs I created, but was poor decision in hindsight for application design. What needs to be done is to create a single 3500 (or more) glyph Unicode Tocharian compliant file. Is Lee Wison’s Unicode proposal workable? Open Type features allow ligatures. This script is rich in ligatures and they are nearly all are irregular. What is the best way to create this substitution list? Open Type tools as part of FontLab?, Python programming?, something else? This may come across as a request for hand-holding but if I was a design student, how would you advise me?

I sincerely thank all of you for taking time from your busy schedule to correspond with me.

Sláinte

Charles

Mikhail Vasilev · January 2023

What is the technical goal in general and in particular? It is not very clear. Firstly there should be encoding table for all of the glyphs. Normally it is just a number-image correspondence. To compose text from image database, need a simple program called composer. I'd recommend Python in general. A basic examples for bitmap composers can be found in the net. I have written some bitmap composers for such task some time ago.

So if you just need to combine text samples this shoud be enough.

But if there are plans for the script officially have a place in Unicode it is a merely "political" question.

Charles · January 2023

Dear Mikhail:

Seems to me, bitmapped images and text, together, in a technical document would be “ugly” for lack of a better word. I do not understand the term “political” in this context.

Sláinte

Charles

Mikhail Vasilev · January 2023

Charles said:

Dear Mikhail:
Seems to me, bitmapped images and text, together, in a technical document would be “ugly” for lack of a better word.

Embedding images, visually it would be the same as vector image, if using high resolution. Might be that Word supports vector images as well. But anyway, with embedded image, bitmap or vector, the biggest issue is that it is not directly editable and "searchable" as normal text.

But it might suffice for the task of just presenting a text for publishing.

Otherwize, if you want to type/edit/search directly in Word or any other standard application using this script, there is much more to be done.

So that is why before giving advises I'd ask what is the general plan and how far it goes.

By "political" I meant that if it should be standartised as part of Unicode I guess it will be up to a committee to decide whether it will happen or not.

Charles · January 2023

Hey Mikhail:

"So that is why before giving advises I'd ask what is the general plan and how far it goes."...good question.

My goal was to make a practical system to drop glyphs into a Word doc. Step 1 was design the font, sieving for all aksarsa combinations in Tocharian published texts. I define a "practical system" in terms user friendly, look-up table as well as the standard Devanāgarī keyboard key-stroke sequence that are commonly used for non-latin fonts. Tocharian has traditionally been transliteration only. There are some linguistic material that I want to re-edit w/ epigraphy included from the very beginning. I am looking for collaborators, failing that, specific knowledge to solve problems as they arise. Problem 1 is how to facilitate large numbers of ligatures in Fontlab5. Unicode standardization would be iceing, decorations and cherry on the cake.

Sláinte

Charles

Andreas Stötzner · January 2023

Charles said:

… Unicode standardization would be iceing, decorations and cherry on the cake.

On the contrary. As it has been said above, a standardised character encoding is the essential precondition for working with this script in the future. If you are serious about it (I assume you are) then this is the first thing you need to go for.

Once there is a complete and solid encoding scheme you can start making useful fonts for that script. If you have such fonts, you can do text work in whatever application. Character input is, practically, another matter; however, there are ways to achieve this (despite us being still chained to 19th century keyboards up to now).

Everything else would be a hack which would be of not much usefulness ‘the next day’.

Andreas Stötzner · January 2023

Simon Cozens said:

… the first step to anything is to get Tocharian encoded in Unicode.

.

Mikhail Vasilev · January 2023

Charles said:

Hey Mikhail:

"So that is why before giving advises I'd ask what is the general plan and how far it goes."...good question.

My goal was to make a practical system to drop glyphs into a Word doc. Step 1 was design the font, sieving for all aksarsa combinations in Tocharian published texts. I define a "practical system" in terms user friendly, look-up table as well as the standard Devanāgarī keyboard key-stroke sequence that are commonly used for non-latin fonts. Tocharian has traditionally been transliteration only. There are some linguistic material that I want to re-edit w/ epigraphy included from the very beginning. I am looking for collaborators, failing that, specific knowledge to solve problems as they arise. Problem 1 is how to facilitate large numbers of ligatures in Fontlab5. Unicode standardization would be iceing, decorations and cherry on the cake.
Sláinte
Charles

Ok I see. I can comment on data input approaches on Windows, since I have some experience here. (Though I am not confident in Dewanagari and what approaches they use).

So on Windows there are several good possibilities to do custom input. But Ideally you'll need some coding experience or find assistance.

For a simple approach, I'd use Autohotkey. It is a powerful tool on top of Windows API that allows developing input applications. Has huge friendly community and its own forum. You can develop GUI apps and windowless apps (just running in background).

But note that it is Windows-only. On Linux there is no such app AFAIK. For MacOS there some analogies but I am not a Mac user so can't really say.

So assuming Windows users, I would make a widget with an input field where I type a space-separated latin syllables like for example "ka ku ko" and then have them converted to desired code-points of your pre-combined glyphs. Then send this generated input upon a hotkey to Word or any other application that can take standard Sendinput protocol. That would cover a big range of Windows applications.

That assumes of course that your font includes ALL of the glyphs at separate code-points.

Similar approach is actually used by Chinese users, they have a helper widget where they start to type and have glyphs preview windows which allows for faster search for desired glyphs. So it is like an advanced virtual keyboard.

Mikhail Vasilev · January 2023

I define a "practical system" in terms user friendly, look-up table as well as the standard Devanāgarī keyboard key-stroke sequence that are commonly used for non-latin fonts.

And the latin presentation that you show on the first screenshot, where does it come from? Is this standard Devanāgarī set? I mean, speaking of user-friendly, and accessibily, I personally would use only standard latin chars, i.e. those found on standard US keyboard layout for that purpose.

Charles · January 2023

Hey Mikhail:

Tocharian is only transliterated. The (mostly) accepted latin transliteration schemata is attachment 4.

The latin table is a glyph to unicode table that tells me where glyphs are located to use previously designed glyphs to construct new glyphs. The table grows as I sieve through new transliterated manuscripts.

A “widget” based on alphabetization was how I envisioned input. For example: l + l would display 10 ligaturen from the list, all ligaturen that began w/ ll- . l + l + y would display just llya, llyā, llyām simultaneously, with just 3 keystrokes ( as there are no llyi, llye, etc. in the ligaturen list so far) No knowledge of keyboard sequence rules just a keyboard mapping of “a” to a, “ā” to A, and “ä” to @ and the widget. Selecting the aksara would drop the character in to the document.

I never knew the visual pinyin input to a Chinese font was called a “helper widget” even though I have used this many times. That one comes with a tiny bitmapped character to assist selection. A nice feature.

Should I be posting on the “FontLab5 for Beginners” thread rather than here? Unicode for Tocharian is not part of the FontLab5 drop-down Unicode Mode, Unicode.org does not have Tocharian, only the provisional code chart in Lee Wilson proposal. So how does one set up the single, 3500 glyph font? Perhaps something about OpenType substitution tables in FontLab5 might help.

“…But ideally you'll need some coding experience or find assistance.”…a truer statement was never typed into a keyboard.

Sláinte

Charles

John Hudson · January 2023

A few thoughts:

Don’t think of Unicode standardised encoding as ‘iceing, decorations and cherry on the cake’. A standardised text encoding model is foundational, and without it anything you come up with is only going to be a stop-gap hack to get little pictures of text elements into a document.

That said, preparing, submitting, and awaiting approval of a Unicode encoding proposal takes time, but there are ways you can come up with a working hack in the meantime that can serve as a proof-of-concept for the proposed encoding model. This means thinking about how to get text elements into documents in a way that mirrors the encoding model, whether at the font level or the input level. This typically also has the benefit of making it easier to subsequently convert documents created using the hack to an eventual standard encoding.

I recommend against hijacking Latin codepoints for your input, since this will always create ambiguities at the document level that get in the way of eventual conversion. Since Tocharian is a left-to-right script, you can fairly easily use codepoints from Unicode’s Private Use Area (PUA).

You have a couple of options in how the proceed:

a) Try to model an eventual Unicode encoding model as directly as possible, possibly using the 2015 preliminary proposal as a basis. In this approach, you would use PUA codepoints only for the characters needed for a plain text level encoding, and code OpenType Layout features (under the DFLT script tag) to perform glyph substitutions (and possibly positioning) for the glyph variants, ligatures, vowel sign attachments, etc.

An issue to bear in mind in this would be the repha treatment of a conjunct-initial R–, which as in most Brahmi-derived scripts takes a post-base (superscript in Tocharian) form, that in a standardised OpenType Indic model would involve re-ordering by the shaping engine. Since a PUA-encoded proof-of-concept hack does not have access to shaping engine support, it would be a challenge to get the repha to behave correctly.

What you end up with in this approach is something that very closely approximates an anticipated standardised encoding, and eventual document conversion can be as simple as swapping individual PUA codepoints in the text for individual Unicode Tocharian codepoints.

b) Use PUA to affect a ‘glyph encoding’, i.e. assign a PUA codepoint to every glyph in the font, and then use a custom keyboard driver (e.g. in the open source Keyman framework, although you can also create installable Windows and Mac OS keyboards) that maps from sequences of letters and formatting characters to conjunct ligatures and other glyph variants.

This approach has the benefit of being fairly robust in terms of not relying on active support for OpenType Layout for PUA characters, and producing documents that will render identically, using the PUA encoded font, in different environments. It also facilitates visual order input, which might be helpful in getting repha to work.

What you end up with in this approach is documents that are the digital equivalent of hand-set metal type: there is a one-to-one mapping from PUA codepoints to each graphical unit of the visible text, without any mapping to characters as they would exist in a clean text encoding. This means that document conversion will require constructing such a mapping, which can be complicated if converting from visual order to phonetic order.

Charles · January 2023

Thanks for reading the thread and the advice! It will take me while to digest and implement.

Sláinte

Charles

John Savard · January 2023

Charles said:

Can I be part of all this? Wait for sponsorship or forge ahead by myself?
My goal was to make a practical system to drop glyphs into a Word doc.

If there is no existing encoding for Tocharian in Unicode, clearly, you may not wish to wait until there is one.

For the time being, you can define a non-Unicode font, in which every glyph has its own codepoint, with no attempt to combine a consonant followed by a vowel into a symbol. That avoids having to work with more advanced features of your font design program for now.

But putting the characters into different fonts is still a mistake. After all, there are more characters in Unicode than just those that represent the Latin alphabet.

Now, you might ask, though, if I assign the additional characters to, say, accented letters, won't it be awkward to type them? I would agree that switching to a French keyboard to type one character, and then switching to a Hungarian keyboard to type the next one would be awkward.

Fortunately, in Windows, there is a program called "Character Map" which you can use to select any character in a font.

A weird thing you could do, if you were desperate, is assign each Tocharian glyph to, say, a Korean glyph, and use the Korean keyboard on your computer! (If the Tocharian syllabary had the same structure as that of, say, Devanagari, you could assign the glyphs that way using an open-source Sanskrit font with a Sanskrit keyboard instead, then the setup for going from multiple keystrokes to one glyph would already be done for you.)

EDIT: I looked up information about the Tocharian script. As it's descended from the Brahmi script, it might seem as if everything is solved, and you could use a Sanskrit font.

Except for something called the "Fremdzeichen".

Instead of putting a vowel mark on consonants for one particular vowel, sometimes an alternate form of the consonant is used (but there is also a vowel mark for that vowel).

But the alternate form of the consonant can also be used with the vowel mark for another vowel, as an alternate way of writing the same syllable!

You could still do this in one font, by switching from a Sanskrit keyboard to a Thai keyboard when you wanted to type Fremdzeichen, or something like that. But maybe it would be simpler to take the same Sanskrit font, and put all the Fremdzeichen into another font which you would label as the "italic" font of the typeface.

That way, texts could be created using only the keyboard and the most common word processing functions. Of course this would horrify Unicode purists, but it would get the job done for the time being, and eventually you could take the glyphs you drew and put them into a proper Unicode Tocharian font once Unicode codepoints are assigned.

Of course, what I'm describing is the quick-and-dirty way of doing this instead of the "right" way, but it appears that this is what may meet your current needs.

Charles · January 2023

Again, thanks John Hudson for the insightful post:

The brāhmī repha in Tocharian is just an “r” of the ligaturen written above the line and provides attachment of the vowel (although I don’t think this this holds true for Sanskrit loans…) See attachment 5. No cognitive dissonance like when one reads Sanskrit. Good example of where a simple look-up table eliminates the complicated keystroke sequence in a typical Devanāgarī keyboard.

Charles · January 2023

Hey John S:

If you look at attachment 1 earlier in thread there is a snip of a latin-unicode table ( attachment 4 also) Characters in bold and underlined are transliterations of the Fremdzeichen eg. –nt or pä. Just like “a” to a, “ā” to A, and “ä” to @ and the widget mentioned above, So, “t” could become T and “p” could become P, advocating a look-up table with a normal English based keyboard and a Devanāgarī keyboard approach.

Thanks for brainstorming about this with me.

Sláinte

Charles

John Hudson · January 2023

The brāhmī repha in Tocharian is just an “r” of the ligaturen written above the line and provides attachment of the vowel (although I don’t think this this holds true for Sanskrit loans…) See attachment 5.

The ‘above the line’ aspect may imply a typical Indic reordering behaviour for repha. The decision will be around whether a) conjunct initial R– behaves differently from other conjunct-initial letters in terms of shaping and b) whether that shaping suggests the possibility of rendering the repha using a combining mark, in which case post-base reordering is likely.

Any new Unicode script with Brahmic shaping behaviours is going to be passed to the Universal Shaping Engine, which means it will be subject to a standardised shaping model derived from the Unicode Indic properties assigned to the characters. This will mean that decisions about how conjunct-initial R– is handled will be ultimately made at the font level, expressed in terms of whether the <rphf> Reph Forms OTL feature is used or not. If <rphf> is used, then the output glyph from that feature will be reordered to the end of the cluster automatically by USE. If the font instead treats R– like any other conjunct-initial letter, then it will not be reordered.

Simon Cozens · January 2023

I think John's answers are the most useful balance between correctness and pragmatism. Long-term, to exchange data sensibly, Tocharian needs to be encoded in Unicode. But that's going to take years and is not going to help someone who just wants to put some Tocharian glyphs into a document today.

So, yes:

Use PUA codepoints, and use them in a way that mirrors the codepoint inventory of the Unicode proposal.
Use a Devanagari-style input mechanism, perhaps making your own keyboard using Keyman or similar, which produces those PUA codepoints.
Because the script isn't encoded, you can't expect any help from the Universal Shaping Engine today; any re-orderings etc. will need to be done inside the font, which might be a pain.

If I were doing this, I might try a "clever-stupid" approach; instead of trying to do it like an Indic font (which is what you would do if you had support from the shaping engine), just work out what combinations of glyphs produce a conjunct and turn your current lookup tables into a big set of substitution rules. What I mean is:

You hit the "k" key on your Keyman keyboard and it produces the codepoint E010 (11E10 from the proposal; subtract 0x3E00 from the codepoints in the proposal to re-root them into the private use area.)
In your font, E010 is mapped to a glyph called "ka-tocharian".
Similarly the "j" key gives you E04F, which is mapped to "virama-tocharian".
In the calt feature (or similar) of your font, you have a huge set of rules like: "sub ka-tocharian i-tocharian by ki-tocharian; sub ka-tocharian virama-tocharian ka-tocharian by kka-tocharian; sub ka-tocharian virama-tocharian sa-tocharian by ksa-tocharian; ..."

This way you will get a system which you can use today, and which feels similar to the "ideal" system you will get once Tocharian is encoded in Unicode. Once that happens, you can update the codepoint mappings in Keyman and in your font, rewrite the rules to make more use of the help provided by the USE, and you've got a Unicode-compliant Tocharian font.

Charles · January 2023

Thanks of the input.

Tocharian A Font and Application

Comments

Categories