Open lexical databases
You will find below a directory of open lexical databases. Click on the name of any database to access their README file and obtain more information and links to datasets.
Usage
- Most datasets are provided in form of
.tsv
or.csv
files (tab-separated-values or comma-separated-values). These are plain text files which can be easily imported in to R or Python, or even opened with Excel. Check out our script examples.
In R or Python, you can directly download datasets from the links provided in the README file. For example:
-
in Python:
import pandas as pd lex = pd.read_csv('http://www.lexique.org/databases/Lexique383/Lexique383.tsv', sep='\t') lex.head()
-
in R:
library(readr) lex = read_tsv('http://www.lexique.org/databases/Lexique383/Lexique383.tsv') head(lex)
Yet, in R, we recommend you to use the R dataset fetcher as:
- it avoids having to specify the location of the dataset on the web
- it will always point to the latest version of a dataset if it has been updated
- it provides a caching mechanism: the dataset will be downloaded only if necessary, otherwise a local copy will be used.
- it checks the sumfile of the dataset to make sure that you have the correct version.
For example, to download the table of Lexique383:
require(tidyverse)
require(rjson)
source('https://raw.githubusercontent.com/chrplr/openlexicon/master/datasets-info/fetch_datasets.R')
lexique383 <- get_lexique383()
- Many of these databases can also be explored or queried on-line at http://www.lexique.org/shiny/openlexicon, thanks to shiny apps from openlexicon.
- Most databases have associated publications listed in their respective
README
files. They should be cited in any derivative work!
Français
Base | Description |
---|---|
Lexique3 | Lexique3 est une base de données lexicales du français qui fournit pour ~140000 mots du français: les représentations orthographiques et phonémiques, les lemmes associés, la syllabation, la catégorie grammaticale, le genre et le nombre, les fréquences dans un corpus de livres et dans un corpus de sous-titres de films, etc. |
Anagrammes | Anagrammes liste plus de 25000 ensembles d’anagrammes du français. |
Voisins | Voisins liste les voisins orthographiques par substitution d’une lettre pour 130000 mots français. |
French Lexicon Project | The French Lexicon Project (FLP) was inspired from the English Lexicon Project (Balota et al. 2007). It provides visual lexical decision time for about 39000 French words and as many pseudowords. The full data represents 1942000 reactions times from 975 participants. |
Megalex | Megalex provides visual and auditory lexical decision times and accuracy rates several thousands of words: Visual lexical decision data are available for 28466 French words and the same number of pseudowords, and auditory lexical decision data are available for 17876 French words and the same number of pseudowords. |
Chronolex | Chronolex provides naming times, lexical decision times and progressive demasking scores on most monosyllabic monomorphemic French (about 1500 items). Thirty-seven participants were tested in the naming task, 35 additionnal participants in the lexical decision task and 33 additionnal participants were tested in the progressive demasking task. |
SILEX | Silex is a database designed to facilitate the study of spelling performance in general, and silent-letter endings in particular. |
Brulex | Brulex donne, pour environ 36.000 mots de la langue française, l’orthographe, la prononciation, la classe grammaticale, le genre, le nombre et la fréquence d’usage. Il contient également d’autres informations utiles à la sélection de matériel expérimental (notamment, point d’unicité, comptage des voisins lexicaux, patrons phonologiques, fréquence moyenne des digrammes). |
Gougenheim100 | Gougenheim100 présente, pour 1064 mots, leur fréquence et leur répartition (nombre de textes dans lesquels ils apparaissent). Le corpus sur lequel, il est basé est un corpus de langue oral basé sur un ensembles d’entretiens avec 275 personnes. C’est donc non seulement un corpus de langue orale mais aussi de langue produite. Le corpus original comprend 163 textes, 312.135 mots et 7.995 lemmes différents. |
Chacqfam | CHACQFAM est une base de données renseignant l’âge d’acquisition estimé et la familiarité de 1225 mots Français |
Frantext | Frantext fournit la liste de tous les types orthographiques obtenus après tokenization du sous-corpus de Frantext utilisé pour calculer les fréquences “livres”” de Lexique. |
francais-GUTenberg | Liste de 336531 mots français obtenue à partir du dictionnaire ispell Français-GUTenberg |
Morphalou | Lexique à large couverture, comprenant 159 271 lemmes et 976 570 formes fléchies, du français moderne. |
Morpholex-fr | Lexical database for ~38k French words with morphological variables. |
Fr- Familiary660 | Familiarités de 660 mots estimées par des adultes jeunes et des adultes âgés. |
SemantiQc | Ces bases de données représentent la familiarité conceptuelle, la force perceptuelle auditive et visuelle de 3596 mots de la langue française auprès de 304 adultes francophones québécois. |
English (American and British)
Base | Description |
---|---|
SUBTLEX-US | SUBTLEXus (Brysbaert, New & Keuleers, 2012) provides two frequency measures based on American movies subtitles (51 million words in total): a) The frequency per million words, called SUBTLEXWF (word form frequency) b) The percentage of films in which a word occurs, called SUBTLEXCD (contextual diversity) |
British Lexicon Project | The British Lexicon Project (Keuleers et al, 2012) contains lexical decision data for over 28,000 monosyllabic and disyllabic English words.. |
English Lexicon Project | The English Lexicon Project provides a standardized behavioral and descriptive data set for 40,481 words and 40,481 nonwords. Data from 816 participants across six universities were collected in a lexical decision task (approximately 3400 responses per participant), and data from 444 participants were collected in a speeded naming task (approximately 2500 responses per participant) |
Morpholex-en | Lexical database for ~70k English words with morphological variables. |
Chinese
Base | Description |
---|---|
SUBTLEX-CH | SUBTLEX-CH (Cai & Brysbaert 2010) is a database of Chinese word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). |
Multilingual
Base | Description |
---|---|
WorldLex | Worldlex provides word frequencies estimated from web pages collected in 66 languages. |
AoA-32lang | AoA-32lang presents a set of subjective Age of Acquisition (AoA) ratings for 299 words (158 nouns, 141 verbs) in 32 languages. |
Similar lists or resources
- Marc Brysbaert’s web site at http://crr.ugent.be/programs-data
- Meiryum Al’s Best 25 Datasets for Natural Language Processing
Contributing
If you want to contribute, check out the OpenLexicon project
Time-stamp: <2019-05-01 11:24:52 christophe@pallier.org>