Our authors sometimes use non-unicode fonts. Example texts include Greek, Ancient Syriac, Old Slavonic, Coptic, transliterations etc. The unicode conversion process has been relatively straight forward, and our authors have approved the output. But, in the worst case scenario we do keep the source encoding.
There's a non-unicode
Bengali font in one of our latest products, which fortunately has some
online converters. The converted text matches the appearance of the author's PDF.
For example:
AviwkbMi > আরশিনগর
It would be good to find a way to automate this within an MS Word manuscript, particularly with more sophisticated scripts. Would anyone have any advice? Thanks.
Comments
If you could go into these two in more detail.
There is probably a way of using VB in Word to do the conversion. What will probably be the biggest effort is creating the conversion table from ascii encoding to utf8. As there is an online converter existing, they may be willing to let you have their table to incorporate into a VB macro.
@Hin-Tak Leung
(1) Is the volume of docs that need converting going to be worth the effort?
Unlikely, but we evangelize unicode compliance and it's a useful learning mechanism.
(2) Are they similar enough to make it more straight forward, or random hacked up semi broken docs all need to be converted.
The authors submits a word document and a PDF. Typically, the majority of the text is English and the fonts within tend to be largely unicode compliant. Non-unicode fonts are used for specific non-English clauses (e.g. Coptic, Greek, Bengali etc).
The "brokenness" of these docs varies between authors. From my limited understanding, they mostly seem good. But, the embedded font inventory of some PDFs includes hundreds of duplicate fonts. Would this cause a problem?
The unicode mapping varies from font-to-font, with little if any consistency. Currently I'm mapping glyph-to-character by eye – which risks human error – or using Word plugins where I can find them. For the actual conversion we outsource to some extent.
A friend mentioned that you could use an MS Word python library, would this also work? Or is it better to use VBA, COM?
You can easily loop through the paragraphs of a document, converting the runs of paragraphs that are using the Bijoy fonts you’ve identified.
For the conversion, some people have already done most of the work which you can find on github: https://github.com/search?utf8=✓&q=bijoy like for example in https://github.com/bahar/BijoyToUnicode/blob/master/bijoy2unicode.php.
It just needs to be translated to Python.
Not sure what you mean by that - if you can automate , whatever number of fonts you use is just a number. It might be easier with pdfs with embedded fonts as they will be subsetted (and therefore only contain used glyphs) - and you can scan fonts within pdfs for PUA usage. The mapping from the PUA to unicode should be available from the font vendor? Or should be consistent per font vendor. Shouldn't need to guess?
What’s the problem in using Unicode text encoding for scripts like Greek or Coptic?
I’d guess at OUP there are some guidelines provided for authors who submit text for editing and composing.
This program comes with a big list of encodings plus it lets you use your own custom tables for anything unusual. It use to cost but is now free.
You may be able to use the encoding table from the Bengali font online converter in this program with a bit of manipulation.
@Hin-Tak Leung It is not PUA usage, but ASCII attached to glyphs that do not match, e.g. AviwkbMi > আরশিনগর. The Adobe Acrobat GUI can create an inventory of fonts in any PDF, which helps when the files are good. But, when the PDF contains hundreds of duplicate fonts, it becomes very difficult to analyze effectively. I'm starting to move between GUI and bash, but I'm not there yet. How would you analyze PDFs for characters-by-font? Is it possible to combine duplicate fonts?
@Andreas Stötzner Our authors do not always read our guidelines, or do not understand what unicode means. In one instance, an author has threatened to withdraw his chapter should we demur on use of "his" font. He believes that the font is unicode compliant, when it is not, and he does not own the font copyright.
@Malcolm Wooden Thanks for the link, I'll need to tie this together this the python-docx library to use MS docs. I get quite varied feedback on our XML, but I am investigating this also – which would be easier overall. Looks very good so far. Thanks.
If you have the documents in the xml form, it might be just a lot of text parsing and substitution.