Letter frequency for letters and their neighbours
Wei Huang
Posts: 98
I'm looking for a resource that shows letters and their most common immediate neighbours. Does anyone know one for Latin script languages? I know the COD provides some pairs for diacritic glyphs. And that Latin+ has some info on 'leftish' and 'rightish' neighbours — meaning they are merely left or right of them.
Edit: I realised what I'm looking for are lists of bigrams by frequency for each letter.
Edit: I realised what I'm looking for are lists of bigrams by frequency for each letter.
0
Comments
-
Didn’t Lucas de Groot have a script like that built for himself a long time ago? I think he showed it in a talk in Zurich around 2009 or so.0
-
Ok it's a pretty simple task. Some quick research after finding out what to look up returned the following:
http://stackoverflow.com/questions/14168601/nltk-makes-it-easy-to-compute-bigrams-of-words-what-about-letters
http://www.indiana.edu/~clcl/Papers/LFE.pdf
http://practicalcryptography.com/media/cryptanalysis/files/english_bigrams_1.txt2 -
Don't know if this would help, but once i was fiddling around, for some project with Antconc. I think it has all the functions that you seek — plus it builds lists yes. quite easy to use.
here's also something in case you didn't find it already.
http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/
If this was completely misunderstood - then nevermind me ¯\_(ツ)_/¯3 -
This thread from Typophile might help (via waybackmachine): https://web.archive.org/web/20140803034102/http://typophile.com/node/31399
Also:
https://web.archive.org/web/20050926082530/http://www.sudtipos.com.ar/test01.txt
https://web.archive.org/web/20050329080256/http://www.sudtipos.com.ar/test02.txt
https://web.archive.org/web/20050329081130/http://www.sudtipos.com.ar/test03.txt
https://web.archive.org/web/20101112040808/http://typophile.com/node/30960
And:
https://web.archive.org/web/20070510225322/http://just.letterror.com/ltrwiki/LetterFrequencyMeter
1 -
1
-
I also recently have examined texts for a thesis. Depending on which texts you are using for this purpose (for example, scientific texts, news, medieval texts) always different results emerge. I studied Letter frequency and Digraphs (Double letters) with a Windows Software (I have run it on a Mac with Wine).
Here is the link to the software “Wortgenerator” (Freeware, can speak English): www.sttmedia.de/wortgenerator-download4 -
Thanks for the suggestions everyone. I plan to analyse the Wikipedia dumps, I'll post results when I get them.Michael Bundscherer
Interesting program, via Google translate I understand the app comes with frequency lists of 'syllables'? I'm trying to compile data to see for each letter, what are their most common left and right neighbours (and case sensitive). Does a digraph qualify for this? I thought a digraph was always a phoneme, so that if I get two letters part of two different sounds that does not qualify, i.e. em in housemeister?0 -
Sorry my mistake! You’re right, it has to be called: bigram or digram (not digraph).
Your approach is very interesting. I have made the same, but studied the “European Convention on Human Rights“ for different languages.
“Wortgenerator” has two functions: It can generate syllables/words and ”counting“ texts. For you, the second function “Counter” comes into consideration. Here you can load ”Plain Text Files“ and counting them in different specifications, for example, the number of letters (letter frequency, with or without differences to case), 2-pairs (diagrams), 3-pairs (trigrams) … real syllables, words.
When I copy the text of https://en.wikipedia.org/wiki/Typography and examine it, I get the following analysis (setting: diagrams, occurrence > 1%) – this you can save as CSV and continue working in Excel. Is this what you are looking for?in 500 2,4217% th 458 2,2182% er 399 1,9325% he 398 1,9276% an 364 1,7630% ti 356 1,7242% es 345 1,6709% te 339 1,6419% re 337 1,6322% on 333 1,6128% or 278 1,3464% en 277 1,3416% ty 272 1,3174% nd 270 1,3077% le 265 1,2835% ng 254 1,2302% it 248 1,2011% ed 236 1,1430% pe 236 1,1430% nt 231 1,1188% al 227 1,0994% at 226 1,0946% yp 223 1,0801% ce 219 1,0607% ra 216 1,0462% of 215 1,0413% is 208 1,0074% se 207 1,0026% Total 8137
I am very interested in the result of your investigation. Please keep me up to date!Another tip: Consider the language setting throughout your workflow. For me, UTF-16 has been proven.1
Categories
- All Categories
- 40 Introductions
- 3.7K Typeface Design
- 795 Font Technology
- 1K Technique and Theory
- 614 Type Business
- 444 Type Design Critiques
- 539 Type Design Software
- 30 Punchcutting
- 136 Lettering and Calligraphy
- 83 Technique and Theory
- 53 Lettering Critiques
- 482 Typography
- 301 History of Typography
- 114 Education
- 67 Resources
- 495 Announcements
- 79 Events
- 105 Job Postings
- 148 Type Releases
- 162 Miscellaneous News
- 269 About TypeDrawers
- 53 TypeDrawers Announcements
- 116 Suggestions and Bug Reports