View on GitHub


Access to lexical databases


WorldLex provides word frequencies tables for 64 languages, estimated from web pages (Blog, Twitter and Newspapers).

The web pages corpora were assembled by Hans Christensen and are available at HC-Copora. According to this web site:

The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language. Once the raw corpus has been collected, it is parsed further, to remove duplicate entries and split into individual lines. Approximately 50% of each entry is then deleted. Since you cannot fully recreate any entries, the entries are anonymised and this is a non-profit venture I believe that it would fall under Fair Use.

The frequencies tables were created by Manuel Gimenes & Boris New



Gimenes, Manuel, and Boris New. 2016. Worldlex: Twitter and Blog Word Frequencies for 66 Languages. Behavior Research Methods 48 (3): 963–72.

Time-stamp: <2019-10-05 09:39:10>