This is an implementation to provide necessary pre-processing steps for modeling an own sent2vec model which is used in the experiments. The two language models we built are a uni-gram and a bi-gram model over the wikipedia 2017 corpus.
This is an implementation to provide necessary preprocessing steps for modeling an own sent2vec model which is used in the experiments. The two language models we built are a uni-gram and a bi-gram model over the wikipedia 2017 corpus.
## RUNNING INSTRUCTIONS
## Pre-Processing Wikipedia Dump
## Preprocessing Wikipedia Dump
Download Wikipedia Dump
- Wikipedia Dumps for the english language is provided on https://meta.wikimedia.org/wiki/Data_dump_torrents#English_Wikipedia