Letter frequency for letters and their neighbours

Wei Huang · February 2016

I'm looking for a resource that shows letters and their most common immediate neighbours. Does anyone know one for Latin script languages? I know the COD provides some pairs for diacritic glyphs. And that Latin+ has some info on 'leftish' and 'rightish' neighbours — meaning they are merely left or right of them.

Edit: I realised what I'm looking for are lists of bigrams by frequency for each letter.

Thierry Blancpain · February 2016

Didn’t Lucas de Groot have a script like that built for himself a long time ago? I think he showed it in a talk in Zurich around 2009 or so.

Wei Huang · February 2016

Ok it's a pretty simple task. Some quick research after finding out what to look up returned the following:

http://stackoverflow.com/questions/14168601/nltk-makes-it-easy-to-compute-bigrams-of-words-what-about-letters

http://www.indiana.edu/~clcl/Papers/LFE.pdf

http://practicalcryptography.com/media/cryptanalysis/files/english_bigrams_1.txt

Mads Wildgaard · February 2016

Don't know if this would help, but once i was fiddling around, for some project with Antconc. I think it has all the functions that you seek — plus it builds lists yes. quite easy to use.

here's also something in case you didn't find it already.
http://practicalcryptography.com/cryptanalysis/letter-frequencies-various-languages/

If this was completely misunderstood - then nevermind me ¯\_(ツ)_/¯

Fernando Díaz · February 2016

This thread from Typophile might help (via waybackmachine): https://web.archive.org/web/20140803034102/http://typophile.com/node/31399

Also:
https://web.archive.org/web/20050926082530/http://www.sudtipos.com.ar/test01.txt
https://web.archive.org/web/20050329080256/http://www.sudtipos.com.ar/test02.txt
https://web.archive.org/web/20050329081130/http://www.sudtipos.com.ar/test03.txt
https://web.archive.org/web/20101112040808/http://typophile.com/node/30960

And:
https://web.archive.org/web/20070510225322/http://just.letterror.com/ltrwiki/LetterFrequencyMeter

PabloImpallari · February 2016

Pathum's scripts
https://github.com/mooniak/textual-tools

Dave Crossland · February 2016

http://www.type-applications.com/site/lettermeter_info.php

Michael Bundscherer · February 2016

I also recently have examined texts for a thesis. Depending on which texts you are using for this purpose (for example, scientific texts, news, medieval texts) always different results emerge. I studied Letter frequency and Digraphs (Double letters) with a Windows Software (I have run it on a Mac with Wine).
Here is the link to the software “Wortgenerator” (Freeware, can speak English): www.sttmedia.de/wortgenerator-download

Wei Huang · February 2016

Thanks for the suggestions everyone. I plan to analyse the Wikipedia dumps, I'll post results when I get them.

Michael Bundscherer
Interesting program, via Google translate I understand the app comes with frequency lists of 'syllables'? I'm trying to compile data to see for each letter, what are their most common left and right neighbours (and case sensitive). Does a digraph qualify for this? I thought a digraph was always a phoneme, so that if I get two letters part of two different sounds that does not qualify, i.e. em in housemeister?

Michael Bundscherer · February 2016

Sorry my mistake! You’re right, it has to be called: bigram or digram (not digraph).

Your approach is very interesting. I have made the same, but studied the “European Convention on Human Rights“ for different languages.

“Wortgenerator” has two functions: It can generate syllables/words and ”counting“ texts. For you, the second function “Counter” comes into consideration. Here you can load ”Plain Text Files“ and counting them in different specifications, for example, the number of letters (letter frequency, with or without differences to case), 2-pairs (diagrams), 3-pairs (trigrams) … real syllables, words.

When I copy the text of https://en.wikipedia.org/wiki/Typography and examine it, I get the following analysis (setting: diagrams, occurrence > 1%) – this you can save as CSV and continue working in Excel. Is this what you are looking for?

in      500  2,4217%
th      458  2,2182%
er      399  1,9325%
he      398  1,9276%
an      364  1,7630%
ti      356  1,7242%
es      345  1,6709%
te      339  1,6419%
re      337  1,6322%
on      333  1,6128%
or      278  1,3464%
en      277  1,3416%
ty      272  1,3174%
nd      270  1,3077%
le      265  1,2835%
ng      254  1,2302%
it      248  1,2011%
ed      236  1,1430%
pe      236  1,1430%
nt      231  1,1188%
al      227  1,0994%
at      226  1,0946%
yp      223  1,0801%
ce      219  1,0607%
ra      216  1,0462%
of      215  1,0413%
is      208  1,0074%
se      207  1,0026%
Total  8137

I am very interested in the result of your investigation. Please keep me up to date!

Another tip: Consider the language setting throughout your workflow. For me, UTF-16 has been proven.

Letter frequency for letters and their neighbours

Comments

Categories