65.000 Gutenberg E-books | ||
>100M words | ||
Slow due to over 1M words in dictionary. Analogy is very good due to corpus size. | ||
35.000 English books. 30.000 books in more than 50 different languages | ||
N-gram, 300-dimensions, minWordFrequency=10, windowSize=5, iterations=10 |
The Lord of The Ring trilogy | ||
400K words | ||
Fast, but analogy is not very good due to the small corpus | ||
English | ||
CWOB, 200-dimensions, minWordFrequency=5, windowSize=6, iterations=10 |
Folketingstidende 1960-2009 | ||
1.2GB text (96K words in dictionary) | ||
Fast, but analogy is not very good due to the small corpus | ||
Danish | ||
CWOB, 300-dimensions, minWordFrequency=20, windowSize=6, iterations=10 |
20 million Danish newspaper pages from 1900 to 2016 | ||
109GB raw text. (20million newspaper pages) | ||
The corpus is an example how to post-fix bad OCR. The newspapers have far from perfect OCR and the word2vec algorithm can detect same word with different misspellings since they appear in the same context. So instead of detecting similar words this corpus instead mostly detects same words with different misspellings. | ||
Danish | ||
N-gram, 300-dimensions, minWordFrequency=100, windowSize=5, iterations=5 |
The full N.F.S. Grundtvig corpus consisting of approx. 1000 publications (1804-1877) | ||
40MB raw text. (37.000 print pages) | ||
The corpus is part of a work in progress hosted at the Grundtvig Study Centre (AU). As such it is split in two: One part furnished with XML markup (1/3) Markup one part raw OCR (2/3) Text For further information:Grundtvig Centeret |
||
Danish | ||
CBOW, 300-dimensions, minWordFrequency=25, windowSize=6, iterations=10 |
Pride and Prejudice, Sense and Sensibility, Emma | ||
150K words | ||
Fast, but analogy is not very good due to the small corpus | ||
English | ||
N-gram, 200-dimensions, minWordFrequency=5, windowSize=5, iterations=10 |