Mapping non-unicode fonts

KP Mawhood
KP Mawhood Posts: 296
edited May 2016 in Technique and Theory
Our authors sometimes use non-unicode fonts. Example texts include Greek, Ancient Syriac, Old Slavonic, Coptic, transliterations etc. The unicode conversion process has been relatively straight forward, and our authors have approved the output. But, in the worst case scenario we do keep the source encoding.

There's a non-unicode Bengali font in one of our latest products, which fortunately has some online converters. The converted text matches the appearance of the author's PDF.

For example: 
AviwkbMi > আরশিনগর

It would be good to find a way to automate this within an MS Word manuscript, particularly with more sophisticated scripts. Would anyone have any advice? Thanks.

Comments

  • Hin-Tak Leung
    Hin-Tak Leung Posts: 363
    I imagine that it would be possible to automate - since it is possible to automate creation/modification of MS word documents with markup/layouts. The usual tools are the older COM, vba, etc. The questions are: (1) is the volume of docs that need converting going to be worth the effort? (2) are they similar enough to make it more straight forward, or random hacked up semi broken docs all need to be converted.

    If you could go into these two in more detail.
  • A few years ago tools like this were quite common when many applications were still in transition to Unicode and they are probably still around.

    There is probably a way of using VB in Word to do the conversion. What will probably be the biggest effort is creating the conversion table from ascii encoding to utf8. As there is an online converter existing, they may be willing to let you have their table to incorporate into a VB macro.
  • KP Mawhood
    KP Mawhood Posts: 296
    Thanks for these insights! This is extremely helpful.

    @Hin-Tak Leung
    (1) Is the volume of docs that need converting going to be worth the effort? 
    Unlikely, but we evangelize unicode compliance and it's a useful learning mechanism.

    (2) A
    re they similar enough to make it more straight forward, or random hacked up semi broken docs all need to be converted.
    The authors submits a word document and a PDF. Typically, the majority of the text is English and the fonts within tend to be largely unicode compliant. Non-unicode fonts are used for specific non-English clauses (e.g. Coptic, Greek, Bengali etc).

    The "brokenness" of these docs varies between authors. From my limited understanding, they mostly seem good. But, the embedded font inventory of some PDFs includes hundreds of duplicate fonts. Would this cause a problem?

    The unicode mapping varies from font-to-font, with little if any consistency. Currently I'm mapping glyph-to-character by eye – which risks human error – or using Word plugins where I can find them. For the actual conversion we outsource to some extent.

    A friend mentioned that you could use an MS Word python library, would this also work? Or is it better to use VBA, COM? 
  • KP Mawhood
    KP Mawhood Posts: 296
    Apart from online conversion tools, or Word plugins, what type of tools do you mean? Can you give any examples?

    You're right, the conversion table is quite time consuming. I had not even thought to contact the creators of the online converter. That's a great idea! We've been in touch for plugins before, and everyone's been very helpful. 
  • The python-docx library should do the job: https://python-docx.readthedocs.io/en/latest/
    You can easily loop through the paragraphs of a document, converting the runs of paragraphs that are using the Bijoy fonts you’ve identified.

    For the conversion, some people have already done most of the work which you can find on github: https://github.com/search?utf8=✓&q=bijoy like for example in https://github.com/bahar/BijoyToUnicode/blob/master/bijoy2unicode.php.
    It just needs to be translated to Python.
  • Hin-Tak Leung
    Hin-Tak Leung Posts: 363
    edited May 2016
    I am somewhat more familiar with pdf than MS word, but I imagine the possibilities are the same.


    ...

    The "brokenness" of these docs varies between authors. From my limited understanding, they mostly seem good. But, the embedded font inventory of some PDFs includes hundreds of duplicate fonts. Would this cause a problem?

    The unicode mapping varies from font-to-font, with little if any consistency. Currently I'm mapping glyph-to-character by eye – which risks human error – or using Word plugins where I can find them. For the actual conversion we outsource to some extent.


    ...

    Not sure what you mean by that - if you can automate , whatever number of fonts you use is just a number. It might be easier with pdfs with embedded fonts as they will be subsetted (and therefore only contain used glyphs) - and you can scan fonts within pdfs for PUA usage. The mapping from the PUA to unicode should be available from the font vendor? Or should be consistent per font vendor. Shouldn't need to guess?
  • > Non-unicode fonts are used for specific non-English clauses (e.g. Coptic, Greek, Bengali etc).

    What’s the problem in using Unicode text encoding for scripts like Greek or Coptic?

    I’d guess at OUP there are some guidelines provided for authors who submit text for editing and composing.

  • Malcolm Wooden
    Malcolm Wooden Posts: 58
    edited May 2016
    We have used EncodingMaster.

    This program comes with a big list of encodings plus it lets you use your own custom tables for anything unusual. It use to cost but is now free.

    You may be able to use the encoding table from the Bengali font online converter in this program with a bit of manipulation.
  • KP Mawhood
    KP Mawhood Posts: 296
    @Denis Moyogo Jacquerye Thank you for this guidance. We don't have anyone in house to help – and I am no programmer – but this is slowly starting to make sense. Package is installed, I'm working through the quick start.

    @Hin-Tak Leung It is not PUA usage, but ASCII attached to glyphs that do not match, e.g. AviwkbMi > আরশিনগর. The Adobe Acrobat GUI can create an inventory of fonts in any PDF, which helps when the files are good. But, when the PDF contains hundreds of duplicate fonts, it becomes very difficult to analyze effectively. I'm starting to move between GUI and bash, but I'm not there yet. How would you analyze PDFs for characters-by-font? Is it possible to combine duplicate fonts?

    @Andreas Stötzner Our authors do not always read our guidelines, or do not understand what unicode means. In one instance, an author has threatened to withdraw his chapter should we demur on use of "his" font. He believes that the font is unicode compliant, when it is not, and he does not own the font copyright.

    @Malcolm Wooden Thanks for the link, I'll need to tie this together this the python-docx library to use MS docs. I get quite varied feedback on our XML, but I am investigating this also – which would be easier overall. Looks very good so far. Thanks.
  • John Hudson
    John Hudson Posts: 3,227
    edited May 2016
    For three of the Murty texts to date we've had to convert from a non-standard font encoding to Unicode text. One was pretty simple, but the two Devanagari ones were tricky. Subsitutions have to be applied in a carefully defined sequence, and since the documents used visual order we had to reorder repha and ikar based on context. I defined the algorithm and Karsten Lueke programmed the script.
  • John Hudson
    John Hudson Posts: 3,227
    For three of the Murty texts to date we've had to convert from a non-standard font encoding to Unicode text. One was pretty simple, but the two Devanagari ones were tricky. Subsitutions have to be applied in a carefully defined sequenxe, and since the documents used visual order we had to reorder repha and ikar based on context. I defined the algorithm and Karsten Lueke programmed the script.
  • Hin-Tak Leung
    Hin-Tak Leung Posts: 363
    edited May 2016
    @Katy Mawhood : you need something that have some understanding of what 'text' are associated with what font, and change the 'text' and font thus. I'd say you need ghostscript :-), and look at the font substitution apI's within.

    If you have the documents in the xml form, it might be just a lot of text parsing and substitution.
  • KP Mawhood
    KP Mawhood Posts: 296
    @Hin-Tak Leung Thanks, time for me to read and learn. :smile: