* Performances_Table.pdf.pdf - a performance table with F1, RI, ARI and JI values of the baseline and 40 experiments (incl. CHERTOY) on the trial data.
* Performances_Table.pdf - a performance table with F1, RI, ARI and JI values of the baseline and 40 experiments (incl. CHERTOY) on the trial data.
* bin
* bin
...
@@ -120,7 +120,7 @@ The folder experiments contains an implementation of the baseline and 40 differe
...
@@ -120,7 +120,7 @@ The folder experiments contains an implementation of the baseline and 40 differe
* lib
* lib
The folder contains code for preprocessing Wikipedia Dataset to train own sent2vec models for the experiments and a README file. Our preprocessed Wikipedia 2017 dataset and two self-trained models of the Wikipedia 2017 dataset, that we used in our experiments with sent2vec, are provided on /proj/toyota on the server of the Institut.
The folder contains code for preprocessing Wikipedia Dataset to train own sent2vec models for the experiments and a README file. Our preprocessed Wikipedia 2017 dataset and two self-trained models of the Wikipedia 2017 dataset, that we used in our experiments with sent2vec, are provided on /proj/toyota on the server of the Institut of Computerlinguistics Heidelberg.
Other models that we used during our experiments can be found in sense2vec and sent2vec repositories.
Other models that we used during our experiments can be found in sense2vec and sent2vec repositories.
WikiExtractor will create several directories AA, AB, AC, ..., CH with a total size of 6.2GB. Each directory contains 100 txt documents (besides CH -> 82).
WikiExtractor will create several directories AA, AB, AC, ..., CH with a total size of 6.2GB. Each directory contains 100 .txt documents (besides CH -> 82).
Each article begins with an ID such as <docid="12"url="https://en.wikipedia.org/wiki?curid=12"title="Anarchism">. Also comments in Parentheses are provided.
Each article begins with an ID such as <docid="12"url="https://en.wikipedia.org/wiki?curid=12"title="Anarchism">. Also comments in Parentheses are provided.
Using preprocess_wikitext.py we delete all IDs, parentheses with their content and also quotes like ' or " and getting a plain wikipedia text. The text file contain one sentence per line.
Using preprocess_wikitext.py we delete all IDs, parentheses with their content and also quotes like ' or " for getting a plain wikipedia text. The output text file contains one sentence per line.