African languages word lists

Kent Lew
Kent Lew Posts: 974
Does anyone have a lead on word collections for some of the more widely spoken African languages? — e.g. Hausa, Ewe, Kabiyè, Fulfulde/Pulaar/Pular, Serer, et al.
Word frequency lists would be ideal. Dictionary collections could work. Bare corpora would have to be massaged significantly, so something already parsed would be preferable.
I’m particularly looking for extensive examples of some of the letters with more unique perimeters, like protruding hooks — ƴɗƈɲɖ more than, say, ɓƙŋ for example.


  • Thomas Phinney
    Thomas Phinney Posts: 2,954
    Denis is indeed your person. He has assembled/updated many resources in this area for Google in the last couple of years. @Eben Sorkin has been working more with corpora but may be able to point to word lists or dictionaries as well.
  • indeed, check eben's octo-text repo (in the "Output/7 African Language Textures" folder) for some text to work with. I don't have full context on what octopus is (maybe internal to google?) but the data in the repo is helpful nonetheless
  • Kent Lew
    Kent Lew Posts: 974
    Thanks for the pointer, guys. Eben’s Octo-text looks like it has some good stuff in it. I’ve never heard of Octopus software. But I’ll dig in and see what I can make use of for my purposes.
    (FWIW, I have been using the Stylus extension in Firefox, and ad-hoc local CSS overrides, to overlay my work-in-progress font on various African-language Wikipedias — e.g.,,,, etc. — which is good for overall appearance, but not a systematic evaluation. 😉)

  • Denis Moyogo Jacquerye
    edited January 11
    Octo-text is a good starting point and we should keep updating it.
    There is also corpus crawler data Google collected a few years ago ( with files with words and their frequency.

    In some cases it’s difficult to get a large sample of text in digital format that can be used.

    *a Mod (Paul Hanslow) has edited the link for it to redirect correctly
  • Kent Lew
    Kent Lew Posts: 974
    Denis — Good lead, thanks. The Corpuscrawler points also to the Unilex project which looks like it has some particularly promising word list resources.
    I agree that there is difficulty in compiling reliable corpora, especially since there is so much noise and flotsam among crawlable digital sources out there.
    So any analysis requires additional manual evaluation. But this gives me some material to start with.
    Thanks again.