This is an implementation to provide preprocessed data for our Word Sense Disambiguation Method 1. The skript will produce json-files for SensEval-2 and 3. This files include sentence splitted lists with lemmatized lowered words in a tuple together with the according WordNet3.0 POS-tag.
The output will be two JSON-files with preprocessed data from SensEval-2 respectively SensEval-3 datasets.
# Senseval Preprocessing for Method 2
This is an implementation to provide preprocessed data for our Word Sense Disambiguation Method 2. The skript will produce pkl-files for each document in Senseval2/3 named as the document name.
From provided Senseval-english-allword-test-data and their Penntree Bank annotations only the useful information will be filtered out. Lemmas which are not included in glossmappings or listed in stopwords will be deleted. For multiword-expressions, only the tag for the head-token will be saved. Information about their satellites will be discarded.
...
...
@@ -29,16 +34,21 @@ gloss_mapping.txt
stopwords.txt
- includes stopwords, which will be filtered out
Python3 skript
Python3 skripts
- senseval_preprocessing.py
- preprocess_senseval_method1.py
## Dependencies
re - for regular expression matching
pickle - for saving the resulting lists in a pkl-file
json - for saving the results for WSD method 1
pickle - for saving the resulting lists in a pkl-file for WSD method 2
nltk - WordNetLemmatizer from NLTK for lemmatizing