Text Samples for Each Script?

Been struggling to assemble a set of test / specimen text samples that cover the set of Unicode scripts. Might be a useful resource for font developers.

I'm not looking for an exhaustive character set (which could be auto-generated) but something more user-friendly (for a specimen book) and serve as a lightweight initial test of a font with extended script coverage. Yes, there are issues of which language to use for a given script, and regional variations, but I'm thinking as a lightweight test, a set of available texts might be useful for font work.

Some options I've looked at:
  • Article 1 of the UDHR (Universal Declaration of Human Rights - "All human beings are born free ...") available at https://github.com/unicode-org/udhr. Good length. However, the 500-ish translations cover only 43 of the 150 Unicode scripts (I'm still on Unicode v12.1).
  • Genesis 11:1 ("Now the whole world had one language and a common speech"). A bit short. I've collected about 76 scripts (only 50%), but and many of the "under-served" scripts have only images available, not Unicode text.
  • Pangrams. Would need development for less-used scripts, and that is daunting.
  • Representative Characters. A small set of characters that demonstrate the typographic attributes of that script. This might be useful, but is typically very short, does not represent body text, and does not give the 'feel' of the script from a user perspective.
  • Character Strings. For scripts I do not have any of the above, I have been falling back on a character string of the first hundred or so assigned characters, excluding combining diacritics and other oddballs, with some random spaces thrown in to approximate body text. Pretty poor substitute for body text, but that's all I've come up with ...
Any thought or suggestions in this area would be appreciated!

Comments

  • ClintGoss
    ClintGoss Posts: 66
    I also considered using the Name Table / Sample Text resources in existing fonts, but only 40 of 1768 open-source fonts that I surveyed populate that field, and most of those are English.
  • Pangrams would still need a lot of development, but this website could help you get a head start — they've got at least 59 languages plus variations with only diacritic marks, etc. I use it pretty frequently.
  • ClintGoss
    ClintGoss Posts: 66
    Thanks Noah ... Clagnut is a great site, but still provides only a fraction of the Unicode scripts (maybe 20-25 ... I have not logged them specifically). But they do provide Klingon ... Kplach!
  • wikipedia is a great source of course, because it’s available in just about any language/script. simply find out which lemma is covered by every language/script :-)

    a few suggestions: ‘woman’ ‘man’ ‘earth’ ‘wind’ ‘fire’ ‘water’ ‘language’ ‘history’ ‘constitution’
  • I use wikipedia very often for sample text. The ones I've used are for roses, deer, libraries, salt etc. whatever you're interested in really. General topics of course will have more languages supported.
    Also, in chrome settings you can set your default sans font to the font you are working on. Then you can just press "random article" for new material :0
    To transfer a wikipedia article to a word document, I copy and paste it into TextEdit, then use shift+command+T twice to remove all hyperlinks quickly. Then I use find and replace to find all instances of ' [ ' to remove all the notations. Then I fix the formatting of paragraphs if necessary.

    Here's the one I'm currently using as a pdf (I guess copy paste out of it? I can't post pages documents on typedrawers). 
    I also have a lot of sample texts for spacing special characters like ŋ and ѭ which I post as pdf or make a google doc or something if you want lol
  • I've used Markov Chains to generate filler text for languages where I had a limited dictionary or no exhaustive text samples to pick from — you do need some text though. It will not create grammatically correct text, but for judging charset support, text color and appearance of common word/letter combinations this approach could work to fill in scripts/languages for which you have little to go on otherwise.

    For example using only the first paragraph text from Wikipedias page on "Earth" as a base distribution you get (endless) generated gems like these (illustrating with English, so you can judge the viability of this approach):
    Earth's only natural resources for their survival.
    Over 99%of all species on earth rotates about 29%of anaerobic and the number of earth today vary widely;most species on earth's surface is tilted with other fresh water, life may have arisen as an earth, earth's gravity interacts with other objects in the densest planet from the arctic ice pack.
    Over many millions of the densest planet from the densest planet from the four rocky planets.
    Over many millions of the sun and the sun and the moon causes tides, earth orbits around the third planet in space, which all species on earth's history have arisen as early as early as early as 4.
    The proliferation of the first billion years.
    7 billion years ago.
    256 sidereal year has gone through long periods of the surface is the combination of earth today vary widely;most massive of earth's interior remains active with respect to affect earth's only natural satellite.
    5 billion years ago.
    Since then, a period known to harbor life on earth's only astronomical object known to its axis, aerobic organisms.
    Earth and other fresh water, earth is land consisting of the sun in the majority of rotation is land consisting of expansion, later, occasionally punctuated by oceans but also lakes, that generates earth's polar regions are covered in the sun and the moon, later, rivers and thrive.
    The world has 366.
    Earth's surface over 99%of species that drives plate tectonics.
    The third planet from the proliferation of the majority of continents and the densest planet in ice pack.
    Since then, the remaining 71%is land consisting of earth is, physical properties and the remaining 71%of the only astronomical object known as 4.
    According to radiometric dating and depend on earth, life may have arisen as an earth.
    Earth sidereal year has around the surface, a convecting mantle that drives plate tectonics.
    Maybe too messy, but one possible idea :)
  • Wow, poetry in it’s purest form.
  • Of course the Universal Declaration of Human Rights is often used. I use it as test case for language identification (~60 languages). Some versions are only available as image. Some are in the wrong language, AFAIR the Yiddish version is in Hebrew language, Bosnian is Serbian.

    You can use word lists and create random (senseless) texts using some probability method and take care, that nearly all characters and most frequent letter-bigrams appear. Something like this is used to generate training data for OCR-Systems. Of course you need (or not) punctuation and numbers.

    http://crubadan.org/ has wordlists with frequencies for 
    2,228 languages, max. 50,000 different words per language. But they are only living languages. E. g. Latin and ancient Greek is missing. Generally it will be hard to get texts for ancient scripts like Cuneiform.

    Another text available in many languages is the Lord's Prayer. 

    http://www.krassotkin.ru/sites/prayer.su/other/all-languages.html has ~370 versions.

    https://wikisource.org/wiki/The_Lord%27s_Prayer has 108 versions.

     

  • John Hudson
    John Hudson Posts: 3,229
    I use UDHR for short specimen strings (usually Article 18, which is a reasonable length), but it is too repetitious and oddly structured to be useful for longer specimens.


  • For Han/Kana/Hangul, I use this for testing, it can reveal the character set support and orthography of ideaographs. 
    12345678-12345678-12345678-12345678-12345678-12345678
    花鳥風月 春夏秋冬 生老病死 喜怒哀樂 櫻梅桃李 起承轉合  (zh-Hant)
    花鸟风月 春夏秋冬 生老病死 喜怒哀乐 樱梅桃李 起承转合  (zh-Hans)
    花鳥風月 春夏秋冬 生老病死 喜怒哀楽 桜梅桃李 起承転結  (ja)
    いろはにほへど  ちりぬるを  わがよたれぞ  つねならむ  (ja)
    ウヰノオクヤマ  ケフコエテ  アサキユメミジ ヱヒモセズ  (ja)
    챠트 피면 술컵도 유효작                  (ko)
  • RichardW
    RichardW Posts: 100
    Just having a text in Unicode is not good enough.  Some of the encodings/spellings collected by Unicode for the UDHR are atrocious, arguably an indictment of the Unicode Standard.  What you need is images plus the Unicode encoding.  It would be reassuring to see that someone has managed to get something similar out of a Unicode font, but beware that such fonts may have limitations.  You may also need a wide collection of writing styles - the Emacs developers have just been unnecessarily worried by font-to-font variations in the gross placement of tashdid and kasra relative to the letter in the Arabic script, and I'm not talking about fancy calligraphy.

    I don't disagree that this could be a significant project for a complex script - just tabulating the typographically significant combinations for Tai Tham was difficult (and significantly incomplete because I missed some Lao combinations), and I now have qualms about the copyright of the sampler when its finished.  I'm working in the UK, so taking and sharing samples from books teaching one how to read and write the script is a bit iffy both legally and morally.