diff --git a/README.md b/README.md index 6babca763705a0a938e9483c64465a3a17313665..35ba020af45b50bb7bbb1d8a647e1be599422744 100644 --- a/README.md +++ b/README.md @@ -95,7 +95,7 @@ After running the system you'll have the output file in your project folder. ### RUN THE SYSTEM: -git clone https://gitlab.cl.uni-heidelberg.de/semantik_project/wsd_chernenko_schindler_toyota.git +git clone https://gitlab.cl.uni-heidelberg.de/semantik_project/wsi_chernenko_schindler_toyota.git cd bin @@ -108,7 +108,7 @@ python3 chertoy.py /home/tatiana/Desktop/FSemantik_Projekt /test /topics.txt /re ### Other files: -* Performances_Table.pdf.pdf - a performance table with F1, RI, ARI and JI values of the baseline and 40 experiments (incl. CHERTOY) on the trial data. +* Performances_Table.pdf - a performance table with F1, RI, ARI and JI values of the baseline and 40 experiments (incl. CHERTOY) on the trial data. * bin @@ -120,7 +120,7 @@ The folder experiments contains an implementation of the baseline and 40 differe * lib -The folder contains code for preprocessing Wikipedia Dataset to train own sent2vec models for the experiments and a README file. Our preprocessed Wikipedia 2017 dataset and two self-trained models of the Wikipedia 2017 dataset, that we used in our experiments with sent2vec, are provided on /proj/toyota on the server of the Institut. +The folder contains code for preprocessing Wikipedia Dataset to train own sent2vec models for the experiments and a README file. Our preprocessed Wikipedia 2017 dataset and two self-trained models of the Wikipedia 2017 dataset, that we used in our experiments with sent2vec, are provided on /proj/toyota on the server of the Institut of Computerlinguistics Heidelberg. Other models that we used during our experiments can be found in sense2vec and sent2vec repositories. * experiments diff --git a/lib/README.md b/lib/README.md index b904bd559ae0b1f98b53fd9f3c6762054fb799ea..1cbec2be2289cc079eabe1088291173abbc57154 100644 --- a/lib/README.md +++ b/lib/README.md @@ -8,15 +8,14 @@ This is an implementation to provide necessary pre-processing steps for modeling Download Wikipedia Dump - Wikipedia Dumps for the english language is provided on https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia -- In our model we used enwiki-20170820-pages-articles-multistream.xml.bz2 (14.1 GiB) +- For our model we used enwiki-20170820-pages-articles-multistream.xml.bz2 (14.1 GiB) Dependencies: - wikiExtractor: http://attardi.github.io/wikiextractor - fasttext: https://github.com/facebookresearch/fastText - sent2vec: https://github.com/epfml/sent2vec - -First of all the wikipedia text needs to be extracted from the provided XML. +First the wikipedia text needs to be extracted from the provided XML. -extracted file: enwiki-20170820-pages-articles-multistream.xml (21.0GB) From the XML the plain text will be extracted using wikiExtractor: @@ -25,9 +24,9 @@ WikiExtractor.py -o OUTPUT-DIRECTORY INPUT-XML-FILE _Example_ WikiExtractor.py -o /wikitext enwiki-20170820-pages-articles-multistream.xml -WikiExtractor will create several directories AA, AB, AC, ..., CH with a total size of 6.2GB. Each directory contains 100 txt documents (besides CH -> 82). +WikiExtractor will create several directories AA, AB, AC, ..., CH with a total size of 6.2GB. Each directory contains 100 .txt documents (besides CH -> 82). Each article begins with an ID such as <doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">. Also comments in Parentheses are provided. -Using preprocess_wikitext.py we delete all IDs, parentheses with their content and also quotes like ' or " and getting a plain wikipedia text. The text file contain one sentence per line. +Using preprocess_wikitext.py we delete all IDs, parentheses with their content and also quotes like ' or " for getting a plain wikipedia text. The output text file contains one sentence per line. _Usage_ python3 preprocess_wikitext.py input_directory_path output_txt_file_path