African languages word lists
Does anyone have a lead on word collections for some of the more widely spoken African languages? — e.g. Hausa, Ewe, Kabiyè, Fulfulde/Pulaar/Pular, Serer, et al.
Word frequency lists would be ideal. Dictionary collections could work. Bare corpora would have to be massaged significantly, so something already parsed would be preferable.
I’m particularly looking for extensive examples of some of the letters with more unique perimeters, like protruding hooks — ƴɗƈɲɖ more than, say, ɓƙŋ for example.
[pinging @Denis Moyogo Jacquerye 😊]
1
Comments
-
Denis is indeed your person. He has assembled/updated many resources in this area for Google in the last couple of years. @Eben Sorkin has been working more with corpora but may be able to point to word lists or dictionaries as well.
1 -
indeed, check eben's octo-text repo (in the "Output/7 African Language Textures" folder) for some text to work with. I don't have full context on what octopus is (maybe internal to google?) but the data in the repo is helpful nonetheless1
-
Thanks for the pointer, guys. Eben’s Octo-text looks like it has some good stuff in it. I’ve never heard of Octopus software. But I’ll dig in and see what I can make use of for my purposes.(FWIW, I have been using the Stylus extension in Firefox, and ad-hoc local CSS overrides, to overlay my work-in-progress font on various African-language Wikipedias — e.g., ee.wikipedia.org, dag.wikipedia.org, ff.wikipedia.org, etc. — which is good for overall appearance, but not a systematic evaluation. 😉)
0 -
Octo-text is a good starting point and we should keep updating it.There is also corpus crawler data Google collected a few years ago (https://github.com/google/corpuscrawler) with files with words and their frequency.In some cases it’s difficult to get a large sample of text in digital format that can be used.
*a Mod (Paul Hanslow) has edited the link for it to redirect correctly2 -
Denis — Good lead, thanks. The Corpuscrawler points also to the Unilex project which looks like it has some particularly promising word list resources.I agree that there is difficulty in compiling reliable corpora, especially since there is so much noise and flotsam among crawlable digital sources out there.So any analysis requires additional manual evaluation. But this gives me some material to start with.Thanks again.1
Categories
- All Categories
- 43 Introductions
- 3.7K Typeface Design
- 806 Font Technology
- 1.1K Technique and Theory
- 622 Type Business
- 446 Type Design Critiques
- 543 Type Design Software
- 30 Punchcutting
- 137 Lettering and Calligraphy
- 84 Technique and Theory
- 53 Lettering Critiques
- 489 Typography
- 304 History of Typography
- 115 Education
- 70 Resources
- 500 Announcements
- 80 Events
- 105 Job Postings
- 149 Type Releases
- 165 Miscellaneous News
- 271 About TypeDrawers
- 53 TypeDrawers Announcements
- 117 Suggestions and Bug Reports