This is an implementation to provide necessary pre-processing steps for modeling an own sent2vec model which is used in the experiments. The two language models we built are a uni-gram and a bi-gram model over the wikipedia 2017 corpus.
## RUNNING INSTRUCTIONS
### Input files:
## Pre-Processing Wikipedia Dump
### Output files:
Download Wikipedia Dump
- Wikipedia Dumps for the english language is provided on https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia
- In our model we used enwiki-20170820-pages-articles-multistream.xml.bz2 (14.1 GiB)
WikiExtractor will create several directories AA, AB, AC, ..., CH with a total size of 6.2GB. Each directory contains 100 txt documents (besides CH -> 82).
Each article begins with an ID such as <docid="12"url="https://en.wikipedia.org/wiki?curid=12"title="Anarchism">. Also comments in Parentheses are provided.
Using preprocess_wikitext.py we delete all IDs, parentheses with their content and also quotes like ' or " and getting a plain wikipedia text. The text file contain one sentence per line.