Skip to content
Snippets Groups Projects
Commit f7b31bb2 authored by wu's avatar wu
Browse files

Update preprocess.py

parent b08880d5
No related branches found
No related tags found
No related merge requests found
# import
# loads skipgram gensim
file_name = "data/1-billion-word-language-modeling-benchmark-r13output.word2vec.vec"
model_gensim = KeyedVectors.load_word2vec_format(file_name)
# initialize tokenizer, => sentences splitting and tokenizing, !pip install stanza
nlp = stanza.Pipeline(lang='en', processors='tokenize')
# loads dataset cnn_dailymail, !pip install datasets
dataset = load_dataset('cnn_dailymail', '3.0.0', split='train[:100]+validaton[:20]+test[:20]') # extract subset for testing
dataset = PreprocessedDataSet(dataset, model_gensim, nlp)
# sent_vecs not set, necessary for Critic, ActorCritic
torch.save(dataset, 'dataset.data')
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment