Digital Scholarship Labs

Word2vec is a high-dimensional word-embedding unsupervised learning algorithm. The most defining characteristics of word2vec is that word that appear in similar context will be close together in the vector-space.
Furthermore distance between words can be generalized and produce qualified guesses for analogies as: man is to woman as king is to ? (queen).
These two methods can be explored here on various corpora. You can request a new corpus: Mail KBLabs

About the corpora:

Select corpus:

Nearest words:

Analogy: is to as is to ?

Corpus:	65.000 Gutenberg E-books
Size:	>100M words
Note:	Slow due to over 1M words in dictionary. Analogy is very good due to corpus size.
Language:	35.000 English books. 30.000 books in more than 50 different languages
Word2vec implementation:	Google
Word2vec options:	N-gram, 300-dimensions, minWordFrequency=10, windowSize=5, iterations=10

Corpus:	The Lord of The Ring trilogy
Size:	400K words
Note:	Fast, but analogy is not very good due to the small corpus
Language:	English
Word2vec implementation:	Google
Word2vec options:	CWOB, 200-dimensions, minWordFrequency=5, windowSize=6, iterations=10

Corpus:	Folketingstidende 1960-2009
Size:	1.2GB text (96K words in dictionary)
Note:	Fast, but analogy is not very good due to the small corpus
Language:	Danish
Word2vec implementation:	Google
Word2vec options:	CWOB, 300-dimensions, minWordFrequency=20, windowSize=6, iterations=10

Corpus:	20 million Danish newspaper pages from 1900 to 2016
Size:	109GB raw text. (20million newspaper pages)
Note:	The corpus is an example how to post-fix bad OCR. The newspapers have far from perfect OCR and the word2vec algorithm can detect same word with different misspellings since they appear in the same context. So instead of detecting similar words this corpus instead mostly detects same words with different misspellings.
Language:	Danish
Word2vec implementation:	Google
Word2vec options:	N-gram, 300-dimensions, minWordFrequency=100, windowSize=5, iterations=5

Corpus:	The full N.F.S. Grundtvig corpus consisting of approx. 1000 publications (1804-1877)
Size:	40MB raw text. (37.000 print pages)
Note:	The corpus is part of a work in progress hosted at the Grundtvig Study Centre (AU). As such it is split in two: One part furnished with XML markup (1/3) Markup one part raw OCR (2/3) Text For further information:Grundtvig Centeret
Language:	Danish
Word2vec implementation:	Google
Word2vec options:	CBOW, 300-dimensions, minWordFrequency=25, windowSize=6, iterations=10

Corpus:	Pride and Prejudice, Sense and Sensibility, Emma
Size:	150K words
Note:	Fast, but analogy is not very good due to the small corpus
Language:	English
Word2vec implementation:	Google
Word2vec options:	N-gram, 200-dimensions, minWordFrequency=5, windowSize=5, iterations=10