From 51555d2236c099600fd166e6aa89cfdeb148ca84 Mon Sep 17 00:00:00 2001 From: toyota <toyota@cl.uni-heidelberg.de> Date: Fri, 30 Mar 2018 19:32:00 +0200 Subject: [PATCH] fix typos --- lib/README.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/lib/README.md b/lib/README.md index 1cbec2b..4cf7703 100644 --- a/lib/README.md +++ b/lib/README.md @@ -15,7 +15,7 @@ Dependencies: - fasttext: https://github.com/facebookresearch/fastText - sent2vec: https://github.com/epfml/sent2vec -First the wikipedia text needs to be extracted from the provided XML. +First the Wikipedia text needs to be extracted from the provided XML. -extracted file: enwiki-20170820-pages-articles-multistream.xml (21.0GB) From the XML the plain text will be extracted using wikiExtractor: @@ -54,7 +54,7 @@ For Uni-grams: For Bi-grams: ./fasttext sent2vec -input /proj/toyota/all_plain_texts.txt -output /proj/toyota/wiki_model_bigram -minCount 1 -dim 700 -epoch 10 -lr 0.2 -wordNgrams 2 -loss ns -neg 10 -thread 20 -t 0.000005 -dropoutK 4 -minCountLabel 20 -bucket 4000000 -In our case it will make a model over 321 million words and a vocabulary containing 4518148 words. +In our case it will make a model over 321 million words and containing 4518148 number of words. ### Output models: Both models are provided on /proj/toyota on the server of the Institute of Computer Linguistics Heidelberg. -- GitLab