Skip to content
Snippets Groups Projects
Commit 3e4e9c27 authored by toyota's avatar toyota
Browse files

fix typo

parent 1aa9b631
No related branches found
No related tags found
No related merge requests found
...@@ -95,7 +95,7 @@ After running the system you'll have the output file in your project folder. ...@@ -95,7 +95,7 @@ After running the system you'll have the output file in your project folder.
### RUN THE SYSTEM: ### RUN THE SYSTEM:
git clone https://gitlab.cl.uni-heidelberg.de/semantik_project/wsd_chernenko_schindler_toyota.git git clone https://gitlab.cl.uni-heidelberg.de/semantik_project/wsi_chernenko_schindler_toyota.git
cd bin cd bin
...@@ -108,7 +108,7 @@ python3 chertoy.py /home/tatiana/Desktop/FSemantik_Projekt /test /topics.txt /re ...@@ -108,7 +108,7 @@ python3 chertoy.py /home/tatiana/Desktop/FSemantik_Projekt /test /topics.txt /re
### Other files: ### Other files:
* Performances_Table.pdf.pdf - a performance table with F1, RI, ARI and JI values of the baseline and 40 experiments (incl. CHERTOY) on the trial data. * Performances_Table.pdf - a performance table with F1, RI, ARI and JI values of the baseline and 40 experiments (incl. CHERTOY) on the trial data.
* bin * bin
...@@ -120,7 +120,7 @@ The folder experiments contains an implementation of the baseline and 40 differe ...@@ -120,7 +120,7 @@ The folder experiments contains an implementation of the baseline and 40 differe
* lib * lib
The folder contains code for preprocessing Wikipedia Dataset to train own sent2vec models for the experiments and a README file. Our preprocessed Wikipedia 2017 dataset and two self-trained models of the Wikipedia 2017 dataset, that we used in our experiments with sent2vec, are provided on /proj/toyota on the server of the Institut. The folder contains code for preprocessing Wikipedia Dataset to train own sent2vec models for the experiments and a README file. Our preprocessed Wikipedia 2017 dataset and two self-trained models of the Wikipedia 2017 dataset, that we used in our experiments with sent2vec, are provided on /proj/toyota on the server of the Institut of Computerlinguistics Heidelberg.
Other models that we used during our experiments can be found in sense2vec and sent2vec repositories. Other models that we used during our experiments can be found in sense2vec and sent2vec repositories.
* experiments * experiments
......
...@@ -8,15 +8,14 @@ This is an implementation to provide necessary pre-processing steps for modeling ...@@ -8,15 +8,14 @@ This is an implementation to provide necessary pre-processing steps for modeling
Download Wikipedia Dump Download Wikipedia Dump
- Wikipedia Dumps for the english language is provided on https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia - Wikipedia Dumps for the english language is provided on https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia
- In our model we used enwiki-20170820-pages-articles-multistream.xml.bz2 (14.1 GiB) - For our model we used enwiki-20170820-pages-articles-multistream.xml.bz2 (14.1 GiB)
Dependencies: Dependencies:
- wikiExtractor: http://attardi.github.io/wikiextractor - wikiExtractor: http://attardi.github.io/wikiextractor
- fasttext: https://github.com/facebookresearch/fastText - fasttext: https://github.com/facebookresearch/fastText
- sent2vec: https://github.com/epfml/sent2vec - sent2vec: https://github.com/epfml/sent2vec
First the wikipedia text needs to be extracted from the provided XML.
First of all the wikipedia text needs to be extracted from the provided XML.
-extracted file: enwiki-20170820-pages-articles-multistream.xml (21.0GB) -extracted file: enwiki-20170820-pages-articles-multistream.xml (21.0GB)
From the XML the plain text will be extracted using wikiExtractor: From the XML the plain text will be extracted using wikiExtractor:
...@@ -25,9 +24,9 @@ WikiExtractor.py -o OUTPUT-DIRECTORY INPUT-XML-FILE ...@@ -25,9 +24,9 @@ WikiExtractor.py -o OUTPUT-DIRECTORY INPUT-XML-FILE
_Example_ _Example_
WikiExtractor.py -o /wikitext enwiki-20170820-pages-articles-multistream.xml WikiExtractor.py -o /wikitext enwiki-20170820-pages-articles-multistream.xml
WikiExtractor will create several directories AA, AB, AC, ..., CH with a total size of 6.2GB. Each directory contains 100 txt documents (besides CH -> 82). WikiExtractor will create several directories AA, AB, AC, ..., CH with a total size of 6.2GB. Each directory contains 100 .txt documents (besides CH -> 82).
Each article begins with an ID such as <doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">. Also comments in Parentheses are provided. Each article begins with an ID such as <doc id="12" url="https://en.wikipedia.org/wiki?curid=12" title="Anarchism">. Also comments in Parentheses are provided.
Using preprocess_wikitext.py we delete all IDs, parentheses with their content and also quotes like ' or " and getting a plain wikipedia text. The text file contain one sentence per line. Using preprocess_wikitext.py we delete all IDs, parentheses with their content and also quotes like ' or " for getting a plain wikipedia text. The output text file contains one sentence per line.
_Usage_ _Usage_
python3 preprocess_wikitext.py input_directory_path output_txt_file_path python3 preprocess_wikitext.py input_directory_path output_txt_file_path
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment